Time-Series Forecasting for Walmart Dataset¶

Data Section¶

In [484]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.seasonal import seasonal_decompose
%matplotlib inline

from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from pmdarima import auto_arima # for determining ARIMA orders
from statsmodels.tsa.statespace.sarimax import SARIMAX

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
import xgboost as xgb

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Display floats with two decimal places
# pd.set_option('display.float_format', '{:.2f}'.format)
In [7]:
# !pip install pmdarima
In [8]:
# !pip install xgboost
In [9]:
calendar_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/calendar.csv')
In [10]:
calendar_df.info
Out[10]:
<bound method DataFrame.info of             date  wm_yr_wk    weekday  wday  month  year       d  \
0     2011-01-29     11101   Saturday     1      1  2011     d_1   
1     2011-01-30     11101     Sunday     2      1  2011     d_2   
2     2011-01-31     11101     Monday     3      1  2011     d_3   
3     2011-02-01     11101    Tuesday     4      2  2011     d_4   
4     2011-02-02     11101  Wednesday     5      2  2011     d_5   
...          ...       ...        ...   ...    ...   ...     ...   
1964  2016-06-15     11620  Wednesday     5      6  2016  d_1965   
1965  2016-06-16     11620   Thursday     6      6  2016  d_1966   
1966  2016-06-17     11620     Friday     7      6  2016  d_1967   
1967  2016-06-18     11621   Saturday     1      6  2016  d_1968   
1968  2016-06-19     11621     Sunday     2      6  2016  d_1969   

      event_name_1 event_type_1  event_name_2 event_type_2  snap_CA  snap_TX  \
0              NaN          NaN           NaN          NaN        0        0   
1              NaN          NaN           NaN          NaN        0        0   
2              NaN          NaN           NaN          NaN        0        0   
3              NaN          NaN           NaN          NaN        1        1   
4              NaN          NaN           NaN          NaN        1        0   
...            ...          ...           ...          ...      ...      ...   
1964           NaN          NaN           NaN          NaN        0        1   
1965           NaN          NaN           NaN          NaN        0        0   
1966           NaN          NaN           NaN          NaN        0        0   
1967           NaN          NaN           NaN          NaN        0        0   
1968  NBAFinalsEnd     Sporting  Father's day     Cultural        0        0   

      snap_WI  
0           0  
1           0  
2           0  
3           0  
4           1  
...       ...  
1964        1  
1965        0  
1966        0  
1967        0  
1968        0  

[1969 rows x 14 columns]>
In [11]:
calendar_df.head()
Out[11]:
date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
0 2011-01-29 11101 Saturday 1 1 2011 d_1 NaN NaN NaN NaN 0 0 0
1 2011-01-30 11101 Sunday 2 1 2011 d_2 NaN NaN NaN NaN 0 0 0
2 2011-01-31 11101 Monday 3 1 2011 d_3 NaN NaN NaN NaN 0 0 0
3 2011-02-01 11101 Tuesday 4 2 2011 d_4 NaN NaN NaN NaN 1 1 0
4 2011-02-02 11101 Wednesday 5 2 2011 d_5 NaN NaN NaN NaN 1 0 1
In [12]:
calendar_df.tail()
Out[12]:
date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
1964 2016-06-15 11620 Wednesday 5 6 2016 d_1965 NaN NaN NaN NaN 0 1 1
1965 2016-06-16 11620 Thursday 6 6 2016 d_1966 NaN NaN NaN NaN 0 0 0
1966 2016-06-17 11620 Friday 7 6 2016 d_1967 NaN NaN NaN NaN 0 0 0
1967 2016-06-18 11621 Saturday 1 6 2016 d_1968 NaN NaN NaN NaN 0 0 0
1968 2016-06-19 11621 Sunday 2 6 2016 d_1969 NBAFinalsEnd Sporting Father's day Cultural 0 0 0
In [13]:
calendar_df["date"] = pd.to_datetime(calendar_df["date"])
In [14]:
total_days = len(calendar_df)
event_days = calendar_df[['event_name_1', 'event_name_2']].dropna(how='all').shape[0]

event_percentage = (event_days / total_days) * 100
no_event_percentage = 100 - event_percentage

labels = ['Days with Events', 'Days without Events']
sizes = [event_percentage, no_event_percentage]

plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, wedgeprops={'edgecolor': 'black'})
plt.title('Percentage of Days with Events')
plt.show()
No description has been provided for this image
In [15]:
event_counts = pd.concat([calendar_df['event_name_1'], calendar_df['event_name_2']]).value_counts()
print("Event Name Counts:\n", event_counts)
Event Name Counts:
 SuperBowl              6
Cinco De Mayo          6
Ramadan starts         6
ValentinesDay          6
Father's day           6
NBAFinalsEnd           6
NBAFinalsStart         6
MemorialDay            6
Mother's day           6
Easter                 6
Pesach End             6
Purim End              6
StPatricksDay          6
LentWeek2              6
LentStart              6
PresidentsDay          6
OrthodoxEaster         6
Thanksgiving           5
MartinLutherKingDay    5
OrthodoxChristmas      5
NewYear                5
Chanukah End           5
Christmas              5
Halloween              5
VeteransDay            5
EidAlAdha              5
ColumbusDay            5
LaborDay               5
Eid al-Fitr            5
IndependenceDay        5
Name: count, dtype: int64
In [16]:
calendar_df[['year', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']].groupby("year").count()
Out[16]:
event_name_1 event_type_1 event_name_2 event_type_2
year
2011 26 26 1 1
2012 30 30 0 0
2013 29 29 1 1
2014 28 28 2 2
2015 30 30 0 0
2016 19 19 1 1
In [17]:
event_type_counts = pd.concat([calendar_df['event_type_1'], calendar_df['event_type_2']]).value_counts()
print("Event Type Counts:\n", event_type_counts)
Event Type Counts:
 Religious    56
National     52
Cultural     41
Sporting     18
Name: count, dtype: int64
In [18]:
unique_event_types = event_type_counts.index
colors = sns.color_palette("Paired", len(unique_event_types))

plt.figure(figsize=(10, 6))
event_type_counts.plot(kind='bar', color=colors, edgecolor='black')
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.title("Distribution of Each Event Category")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image
In [19]:
calendar_df['snap_CA'] = calendar_df['snap_CA'].astype(bool)
calendar_df['snap_TX'] = calendar_df['snap_TX'].astype(bool)
calendar_df['snap_WI'] = calendar_df['snap_WI'].astype(bool)

snap_melted = calendar_df.melt(value_vars=['snap_CA', 'snap_TX', 'snap_WI'], 
                               var_name='State', 
                               value_name='SNAP_Purchase')

snap_counts = snap_melted.groupby(['State', 'SNAP_Purchase']).size().reset_index(name='Count')
snap_counts['Percentage'] = snap_counts.groupby('State')['Count'].transform(lambda x: x / x.sum() * 100)

g = sns.catplot(data=snap_counts, x='SNAP_Purchase', y='Percentage', col='State', kind='bar',palette="Paired", 
                col_order=['snap_CA', 'snap_TX', 'snap_WI'], height=3, aspect=1)
g.set_axis_labels("", "%")
g.set_titles(col_template="{col_name}")
plt.suptitle("Days with SNAP purchases for all states", fontsize=12)
plt.show()
No description has been provided for this image
In [20]:
sales_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/sales_train_evaluation.csv')
In [21]:
sales_df.info
Out[21]:
<bound method DataFrame.info of                                   id        item_id    dept_id   cat_id  \
0      HOBBIES_1_001_CA_1_evaluation  HOBBIES_1_001  HOBBIES_1  HOBBIES   
1      HOBBIES_1_002_CA_1_evaluation  HOBBIES_1_002  HOBBIES_1  HOBBIES   
2      HOBBIES_1_003_CA_1_evaluation  HOBBIES_1_003  HOBBIES_1  HOBBIES   
3      HOBBIES_1_004_CA_1_evaluation  HOBBIES_1_004  HOBBIES_1  HOBBIES   
4      HOBBIES_1_005_CA_1_evaluation  HOBBIES_1_005  HOBBIES_1  HOBBIES   
...                              ...            ...        ...      ...   
30485    FOODS_3_823_WI_3_evaluation    FOODS_3_823    FOODS_3    FOODS   
30486    FOODS_3_824_WI_3_evaluation    FOODS_3_824    FOODS_3    FOODS   
30487    FOODS_3_825_WI_3_evaluation    FOODS_3_825    FOODS_3    FOODS   
30488    FOODS_3_826_WI_3_evaluation    FOODS_3_826    FOODS_3    FOODS   
30489    FOODS_3_827_WI_3_evaluation    FOODS_3_827    FOODS_3    FOODS   

      store_id state_id  d_1  d_2  d_3  d_4  ...  d_1932  d_1933  d_1934  \
0         CA_1       CA    0    0    0    0  ...       2       4       0   
1         CA_1       CA    0    0    0    0  ...       0       1       2   
2         CA_1       CA    0    0    0    0  ...       1       0       2   
3         CA_1       CA    0    0    0    0  ...       1       1       0   
4         CA_1       CA    0    0    0    0  ...       0       0       0   
...        ...      ...  ...  ...  ...  ...  ...     ...     ...     ...   
30485     WI_3       WI    0    0    2    2  ...       1       0       3   
30486     WI_3       WI    0    0    0    0  ...       0       0       0   
30487     WI_3       WI    0    6    0    2  ...       0       0       1   
30488     WI_3       WI    0    0    0    0  ...       1       1       1   
30489     WI_3       WI    0    0    0    0  ...       1       2       0   

       d_1935  d_1936  d_1937  d_1938  d_1939  d_1940  d_1941  
0           0       0       0       3       3       0       1  
1           1       1       0       0       0       0       0  
2           0       0       0       2       3       0       1  
3           4       0       1       3       0       2       6  
4           2       1       0       0       2       1       0  
...       ...     ...     ...     ...     ...     ...     ...  
30485       0       1       1       0       0       1       1  
30486       0       0       0       1       0       1       0  
30487       2       0       1       0       1       0       2  
30488       4       6       0       1       1       1       0  
30489       5       4       0       2       2       5       1  

[30490 rows x 1947 columns]>
In [22]:
sales_df.head()
Out[22]:
id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4 ... d_1932 d_1933 d_1934 d_1935 d_1936 d_1937 d_1938 d_1939 d_1940 d_1941
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 2 4 0 0 0 0 3 3 0 1
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 0 1 2 1 1 0 0 0 0 0
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 1 0 2 0 0 0 2 3 0 1
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 1 1 0 4 0 1 3 0 2 6
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 0 0 0 2 1 0 0 2 1 0

5 rows × 1947 columns

In [23]:
sales_df.tail()
Out[23]:
id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4 ... d_1932 d_1933 d_1934 d_1935 d_1936 d_1937 d_1938 d_1939 d_1940 d_1941
30485 FOODS_3_823_WI_3_evaluation FOODS_3_823 FOODS_3 FOODS WI_3 WI 0 0 2 2 ... 1 0 3 0 1 1 0 0 1 1
30486 FOODS_3_824_WI_3_evaluation FOODS_3_824 FOODS_3 FOODS WI_3 WI 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
30487 FOODS_3_825_WI_3_evaluation FOODS_3_825 FOODS_3 FOODS WI_3 WI 0 6 0 2 ... 0 0 1 2 0 1 0 1 0 2
30488 FOODS_3_826_WI_3_evaluation FOODS_3_826 FOODS_3 FOODS WI_3 WI 0 0 0 0 ... 1 1 1 4 6 0 1 1 1 0
30489 FOODS_3_827_WI_3_evaluation FOODS_3_827 FOODS_3 FOODS WI_3 WI 0 0 0 0 ... 1 2 0 5 4 0 2 2 5 1

5 rows × 1947 columns

In [24]:
cols = sales_df.dtypes.index.tolist()
types = sales_df.dtypes.values.tolist()
sales_df.dtypes.index, sales_df.dtypes.values
Out[24]:
(Index(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd_1',
        'd_2', 'd_3', 'd_4',
        ...
        'd_1932', 'd_1933', 'd_1934', 'd_1935', 'd_1936', 'd_1937', 'd_1938',
        'd_1939', 'd_1940', 'd_1941'],
       dtype='object', length=1947),
 array([dtype('O'), dtype('O'), dtype('O'), ..., dtype('int64'),
        dtype('int64'), dtype('int64')], dtype=object))
In [25]:
def downcast(df):
    cols = df.dtypes.index.tolist()
    types = df.dtypes.values.tolist()
    for i, t in enumerate(types):
        if 'int' in str(t):
            if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
                df[cols[i]] = df[cols[i]].astype(np.int8)
            elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
                df[cols[i]] = df[cols[i]].astype(np.int16)
            elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
                df[cols[i]] = df[cols[i]].astype(np.int32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.int64)
        elif 'float' in str(t):
            if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
                df[cols[i]] = df[cols[i]].astype(np.float16)
            elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
                df[cols[i]] = df[cols[i]].astype(np.float32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.float64)
                
        elif t == object:
            if cols[i] == 'date':
                df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
            else:
                df[cols[i]] = df[cols[i]].astype('category')
    return df
In [26]:
sales_df = downcast(sales_df)
In [27]:
prices = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/sell_prices.csv')
In [28]:
prices.head()
Out[28]:
store_id item_id wm_yr_wk sell_price
0 CA_1 HOBBIES_1_001 11325 9.58
1 CA_1 HOBBIES_1_001 11326 9.58
2 CA_1 HOBBIES_1_001 11327 8.26
3 CA_1 HOBBIES_1_001 11328 8.26
4 CA_1 HOBBIES_1_001 11329 8.26
In [29]:
prices.tail()
Out[29]:
store_id item_id wm_yr_wk sell_price
6841116 WI_3 FOODS_3_827 11617 1.0
6841117 WI_3 FOODS_3_827 11618 1.0
6841118 WI_3 FOODS_3_827 11619 1.0
6841119 WI_3 FOODS_3_827 11620 1.0
6841120 WI_3 FOODS_3_827 11621 1.0
In [30]:
prices = downcast(prices)
In [31]:
df = pd.melt(sales_df, id_vars=['id','item_id','dept_id','cat_id','store_id','state_id'],
        var_name='d',value_name='unit_sold').dropna()
df.head()
Out[31]:
id item_id dept_id cat_id store_id state_id d unit_sold
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA d_1 0
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA d_1 0
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA d_1 0
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA d_1 0
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA d_1 0
In [32]:
df = pd.merge(df, calendar_df, on='d', how='left')
df.head()
Out[32]:
id item_id dept_id cat_id store_id state_id d unit_sold date wm_yr_wk ... wday month year event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False

5 rows × 21 columns

In [33]:
df.head()
Out[33]:
id item_id dept_id cat_id store_id state_id d unit_sold date wm_yr_wk ... wday month year event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 1 2011 NaN NaN NaN NaN False False False

5 rows × 21 columns

In [36]:
df = pd.merge(df, prices, on=['store_id','item_id','wm_yr_wk'], how='left')
df.head()
Out[36]:
id item_id dept_id cat_id store_id state_id d unit_sold date wm_yr_wk ... month year event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI sell_price
0 HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 2011 NaN NaN NaN NaN False False False NaN
1 HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 2011 NaN NaN NaN NaN False False False NaN
2 HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 2011 NaN NaN NaN NaN False False False NaN
3 HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 2011 NaN NaN NaN NaN False False False NaN
4 HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA d_1 0 2011-01-29 11101 ... 1 2011 NaN NaN NaN NaN False False False NaN

5 rows × 22 columns

In [37]:
df_grouped = df.groupby("dept_id")["unit_sold"].sum().reset_index()
print(df_grouped)
       dept_id  unit_sold
0      FOODS_1    5190400
1      FOODS_2    7795025
2      FOODS_3   32937002
3    HOBBIES_1    5699014
4    HOBBIES_2     541642
5  HOUSEHOLD_1   11722853
6  HOUSEHOLD_2    3041237
In [38]:
plt.figure(figsize=(12, 6))
plt.bar(df_grouped["dept_id"], df_grouped["unit_sold"], color='skyblue', edgecolor='black', alpha=0.7)
plt.xlabel("Department ID")
plt.ylabel("Total Units Sold")
plt.title("Total Units Sold per Department")
plt.xticks(rotation=0)  
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image
In [39]:
df["sell_price"] = df["sell_price"].fillna(0)
In [40]:
df["revenue"] = df["unit_sold"] * df["sell_price"]
df_grouped_revenue = df.groupby("dept_id")["revenue"].sum().reset_index()

plt.figure(figsize=(12, 6))
plt.bar(df_grouped_revenue["dept_id"], df_grouped_revenue["revenue"], color='lightcoral', edgecolor='black', alpha=0.7)
plt.xlabel("Department ID")
plt.ylabel("Total Revenue")
plt.title("Total Revenue per Department")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image
In [41]:
Foods3 = df[df['dept_id'] == 'FOODS_3'].copy()
In [42]:
Foods3.head()
Out[42]:
id item_id dept_id cat_id store_id state_id d unit_sold date wm_yr_wk ... year event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI sell_price revenue
2226 FOODS_3_001_CA_1_evaluation FOODS_3_001 FOODS_3 FOODS CA_1 CA d_1 1 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 2.279297 2.279297
2227 FOODS_3_002_CA_1_evaluation FOODS_3_002 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2228 FOODS_3_003_CA_1_evaluation FOODS_3_003 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2229 FOODS_3_004_CA_1_evaluation FOODS_3_004 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2230 FOODS_3_005_CA_1_evaluation FOODS_3_005 FOODS_3 FOODS CA_1 CA d_1 1 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 1.980469 1.980469

5 rows × 23 columns

In [43]:
sns.boxplot(data=Foods3, x='state_id', y='unit_sold')
plt.xlabel('State ID')
plt.ylabel('Units Sold')
plt.title('Boxplot of Foods3 Units Sold by State')
plt.show()
No description has been provided for this image
In [44]:
for col in ['state_id', 'store_id', 'dept_id']:
    Foods3[col] = Foods3[col].astype('category')

group = Foods3.groupby(['year', 'date', 'state_id', 'store_id'], as_index=False)['unit_sold'].sum()

fig = px.violin(group, x='store_id', color='state_id', y='unit_sold', box=True)
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Total Foods3 items sold')
fig.update_layout(template='seaborn', title='Distribution of Foods3 Sold by stores', legend_title_text='State')
fig.show()
In [45]:
total_time_series = Foods3.groupby('date', as_index=False)['unit_sold'].sum()

# Create a line plot for total Foods3 sales over time
fig = px.line(total_time_series, x='date', y='unit_sold',
              title='Time Series of Total Foods3 Sales',
              labels={'unit_sold': 'Total Units Sold', 'date': 'Date'})
fig.show()
In [46]:
# Aggregate unit_sold by state 
state_level_time_series = Foods3.groupby(['date', 'state_id'], as_index=False)['unit_sold'].sum()
fig = px.line(state_level_time_series,x='date', y='unit_sold', color='state_id',facet_col='state_id', 
    title='Time Series of Foods3 Sales at State Level',
    labels={'unit_sold': 'Total Units Sold', 'date': 'Date', 'state_id': 'State'},
    height=600,
    facet_col_wrap=1
)

fig.update_layout(
    showlegend=False,  
    xaxis_title="Date",
    yaxis_title="Total Units Sold",
    margin=dict(l=40, r=40, t=40, b=40)
)
fig.show()
In [47]:
# Aggregate unit_sold by store
time_series_data = Foods3.groupby(['date', 'store_id', 'state_id'], as_index=False)['unit_sold'].sum()

fig = px.line(time_series_data, x='date', y='unit_sold', color='store_id', facet_row='state_id',
              title='Time Series of Foods3 Sales per Store',
              labels={'unit_sold': 'Total Units Sold', 'date': 'Date', 'store_id': 'Store', 'state_id': 'State'},
              height=900)
fig.show()
In [48]:
Foods3_CA = Foods3[Foods3['state_id'] == 'CA'].copy()
Foods3_CA.head()
Out[48]:
id item_id dept_id cat_id store_id state_id d unit_sold date wm_yr_wk ... year event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI sell_price revenue
2226 FOODS_3_001_CA_1_evaluation FOODS_3_001 FOODS_3 FOODS CA_1 CA d_1 1 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 2.279297 2.279297
2227 FOODS_3_002_CA_1_evaluation FOODS_3_002 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2228 FOODS_3_003_CA_1_evaluation FOODS_3_003 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2229 FOODS_3_004_CA_1_evaluation FOODS_3_004 FOODS_3 FOODS CA_1 CA d_1 0 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 0.000000 0.000000
2230 FOODS_3_005_CA_1_evaluation FOODS_3_005 FOODS_3 FOODS CA_1 CA d_1 1 2011-01-29 11101 ... 2011 NaN NaN NaN NaN False False False 1.980469 1.980469

5 rows × 23 columns

In [49]:
columns_to_drop = ["id", "item_id","state_id", "dept_id", "cat_id", "wm_yr_wk", "year","weekday","wday","month","snap_TX","snap_WI"]
Foods3_CA = Foods3_CA.drop(columns=columns_to_drop)
Foods3_CA.head()
Out[49]:
store_id d unit_sold date event_name_1 event_type_1 event_name_2 event_type_2 snap_CA sell_price revenue
2226 CA_1 d_1 1 2011-01-29 NaN NaN NaN NaN False 2.279297 2.279297
2227 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN False 0.000000 0.000000
2228 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN False 0.000000 0.000000
2229 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN False 0.000000 0.000000
2230 CA_1 d_1 1 2011-01-29 NaN NaN NaN NaN False 1.980469 1.980469
In [50]:
Foods3_CA["snap_CA"] = Foods3_CA["snap_CA"].astype(int)
Foods3_CA.head()
Out[50]:
store_id d unit_sold date event_name_1 event_type_1 event_name_2 event_type_2 snap_CA sell_price revenue
2226 CA_1 d_1 1 2011-01-29 NaN NaN NaN NaN 0 2.279297 2.279297
2227 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN 0 0.000000 0.000000
2228 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN 0 0.000000 0.000000
2229 CA_1 d_1 0 2011-01-29 NaN NaN NaN NaN 0 0.000000 0.000000
2230 CA_1 d_1 1 2011-01-29 NaN NaN NaN NaN 0 1.980469 1.980469
In [51]:
Foods3_CA_agg = Foods3_CA.groupby(["d","store_id"], as_index=False).agg(
    {
        "date": "first",
        "d":"first",
        "store_id":"first",
        "unit_sold": "sum",  # Summing total units sold
        "sell_price": "mean",  # Averaging sell price
        "revenue": "sum",  # Summing total revenue
        "event_name_1": "first",
        "event_type_1": "first",
        "event_name_2": "first",
        "event_type_2": "first",
        "snap_CA": "max",  
    }
)
Foods3_CA_agg = Foods3_CA_agg.dropna(subset=["d", "store_id"])
In [52]:
Foods3_CA_agg.head()
Out[52]:
date d store_id unit_sold sell_price revenue event_name_1 event_type_1 event_name_2 event_type_2 snap_CA
0 2011-01-29 d_1 CA_1 2268 1.161917 4323.710449 None None None None 0.0
1 2011-01-29 d_1 CA_2 1575 1.027044 3093.028076 None None None None 0.0
2 2011-01-29 d_1 CA_3 2478 1.136871 4282.417969 None None None None 0.0
3 2011-01-29 d_1 CA_4 759 1.047593 1566.507812 None None None None 0.0
10 2011-02-07 d_10 CA_1 1711 1.220468 3310.140137 None None None None 1.0
In [53]:
Foods3_CA_agg.info
Out[53]:
<bound method DataFrame.info of             date      d store_id  unit_sold  sell_price      revenue  \
0     2011-01-29    d_1     CA_1       2268    1.161917  4323.710449   
1     2011-01-29    d_1     CA_2       1575    1.027044  3093.028076   
2     2011-01-29    d_1     CA_3       2478    1.136871  4282.417969   
3     2011-01-29    d_1     CA_4        759    1.047593  1566.507812   
10    2011-02-07   d_10     CA_1       1711    1.220468  3310.140137   
...          ...    ...      ...        ...         ...          ...   
19393 2013-10-22  d_998     CA_4       1093    2.349193  2560.722412   
19400 2013-10-23  d_999     CA_1       1822    2.393991  4388.458008   
19401 2013-10-23  d_999     CA_2       1018    2.250309  2336.334473   
19402 2013-10-23  d_999     CA_3       2643    2.363897  5701.542969   
19403 2013-10-23  d_999     CA_4        937    2.349193  2224.834473   

      event_name_1 event_type_1 event_name_2 event_type_2  snap_CA  
0             None         None         None         None      0.0  
1             None         None         None         None      0.0  
2             None         None         None         None      0.0  
3             None         None         None         None      0.0  
10            None         None         None         None      1.0  
...            ...          ...          ...          ...      ...  
19393         None         None         None         None      0.0  
19400         None         None         None         None      0.0  
19401         None         None         None         None      0.0  
19402         None         None         None         None      0.0  
19403         None         None         None         None      0.0  

[7764 rows x 11 columns]>
In [54]:
# Filter rows where event_name_2 is not None and include event_name_1 on the same date
event1_event2_dates = Foods3_CA_agg.dropna(subset=["event_name_2"])[["date", "event_name_1", "event_name_2"]]
unique_event1_event2_dates = event1_event2_dates.drop_duplicates()

print("Unique Dates with Event Name 1 and Event Name 2:")
print(unique_event1_event2_dates)  
Unique Dates with Event Name 1 and Event Name 2:
            date    event_name_1    event_name_2
1990  2014-04-20          Easter  OrthodoxEaster
2620  2014-06-15    NBAFinalsEnd    Father's day
17510 2013-05-05  OrthodoxEaster   Cinco De Mayo
17860 2011-04-24  OrthodoxEaster          Easter
In [55]:
event_dummies1 = pd.get_dummies(Foods3_CA_agg["event_name_1"], prefix="event1")
Foods3_CA_corr = pd.concat([Foods3_CA_agg, event_dummies1], axis=1)
numeric_cols = ["unit_sold"] + list(event_dummies1.columns)  # Only unit_sold and event dummies
correlation_matrix = Foods3_CA_corr[numeric_cols].corr()
event_correlations = correlation_matrix["unit_sold"].drop("unit_sold").sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=event_correlations.values, y=event_correlations.index, palette="coolwarm")
plt.xlabel("Correlation with Unit Sold")
plt.ylabel("Event Name1")
plt.title("Correlation Between Events and Units Sold")
plt.show()
No description has been provided for this image
In [56]:
top_5_absolute_correlated_events_unit = event_correlations.abs().sort_values(ascending=False).head(5).reset_index()
top_5_absolute_correlated_events_unit.columns = ["event_name_1", "absolute_correlation_with_unit_sold"]
print(top_5_absolute_correlated_events_unit)
          event_name_1  absolute_correlation_with_unit_sold
0     event1_Christmas                             0.104071
1      event1_LaborDay                             0.033212
2  event1_Thanksgiving                             0.031481
3       event1_NewYear                             0.029564
4     event1_SuperBowl                             0.028783
In [57]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Foods3_CA_corr = pd.concat([Foods3_CA_agg, event_dummies1], axis=1)
numeric_cols = ["revenue"] + list(event_dummies1.columns)  # Only revenue and event dummies
correlation_matrix = Foods3_CA_corr[numeric_cols].corr()
event_correlations_revenue = correlation_matrix["revenue"].drop("revenue").sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=event_correlations_revenue.values, y=event_correlations_revenue.index, palette="coolwarm")
plt.xlabel("Correlation with Revenue")
plt.ylabel("Event Name")
plt.title("Correlation Between Events and Revenue")
plt.show()
No description has been provided for this image
In [58]:
top_5_absolute_correlated_events_revenue = event_correlations_revenue.abs().sort_values(ascending=False).head(5).reset_index()
top_5_absolute_correlated_events_revenue.columns = ["event_name_1", "absolute_correlation_revenue"]
print(top_5_absolute_correlated_events_revenue)
          event_name_1  absolute_correlation_revenue
0     event1_Christmas                      0.107807
1  event1_Thanksgiving                      0.036987
2       event1_NewYear                      0.030744
3     event1_SuperBowl                      0.030330
4      event1_LaborDay                      0.030169
In [59]:
top_events = ["Christmas", "Thanksgiving", "NewYear", "SuperBowl", "LaborDay"]

event_dummies = pd.get_dummies(Foods3_CA_agg["event_name_1"], prefix="event1", dtype=int)

event_dummies = event_dummies[[f"event1_{event}" for event in top_events if f"event1_{event}" in event_dummies.columns]]

Foods3_CA_dummies = pd.concat([Foods3_CA_agg, event_dummies], axis=1)
In [60]:
Foods3_CA_dummies.head()
Out[60]:
date d store_id unit_sold sell_price revenue event_name_1 event_type_1 event_name_2 event_type_2 snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2011-01-29 d_1 CA_1 2268 1.161917 4323.710449 None None None None 0.0 0 0 0 0 0
1 2011-01-29 d_1 CA_2 1575 1.027044 3093.028076 None None None None 0.0 0 0 0 0 0
2 2011-01-29 d_1 CA_3 2478 1.136871 4282.417969 None None None None 0.0 0 0 0 0 0
3 2011-01-29 d_1 CA_4 759 1.047593 1566.507812 None None None None 0.0 0 0 0 0 0
10 2011-02-07 d_10 CA_1 1711 1.220468 3310.140137 None None None None 1.0 0 0 0 0 0
In [61]:
Foods3_CA_df= Foods3_CA_dummies.drop(columns=["d","event_name_1", "event_name_2", "event_type_1", "event_type_2", "event_category"], errors="ignore")
In [62]:
Foods3_CA_df.head()
Out[62]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2011-01-29 CA_1 2268 1.161917 4323.710449 0.0 0 0 0 0 0
1 2011-01-29 CA_2 1575 1.027044 3093.028076 0.0 0 0 0 0 0
2 2011-01-29 CA_3 2478 1.136871 4282.417969 0.0 0 0 0 0 0
3 2011-01-29 CA_4 759 1.047593 1566.507812 0.0 0 0 0 0 0
10 2011-02-07 CA_1 1711 1.220468 3310.140137 1.0 0 0 0 0 0
In [63]:
Foods3_CA_df.info
Out[63]:
<bound method DataFrame.info of             date store_id  unit_sold  sell_price      revenue  snap_CA  \
0     2011-01-29     CA_1       2268    1.161917  4323.710449      0.0   
1     2011-01-29     CA_2       1575    1.027044  3093.028076      0.0   
2     2011-01-29     CA_3       2478    1.136871  4282.417969      0.0   
3     2011-01-29     CA_4        759    1.047593  1566.507812      0.0   
10    2011-02-07     CA_1       1711    1.220468  3310.140137      1.0   
...          ...      ...        ...         ...          ...      ...   
19393 2013-10-22     CA_4       1093    2.349193  2560.722412      0.0   
19400 2013-10-23     CA_1       1822    2.393991  4388.458008      0.0   
19401 2013-10-23     CA_2       1018    2.250309  2336.334473      0.0   
19402 2013-10-23     CA_3       2643    2.363897  5701.542969      0.0   
19403 2013-10-23     CA_4        937    2.349193  2224.834473      0.0   

       event1_Christmas  event1_Thanksgiving  event1_NewYear  \
0                     0                    0               0   
1                     0                    0               0   
2                     0                    0               0   
3                     0                    0               0   
10                    0                    0               0   
...                 ...                  ...             ...   
19393                 0                    0               0   
19400                 0                    0               0   
19401                 0                    0               0   
19402                 0                    0               0   
19403                 0                    0               0   

       event1_SuperBowl  event1_LaborDay  
0                     0                0  
1                     0                0  
2                     0                0  
3                     0                0  
10                    0                0  
...                 ...              ...  
19393                 0                0  
19400                 0                0  
19401                 0                0  
19402                 0                0  
19403                 0                0  

[7764 rows x 11 columns]>
In [64]:
Foods3_CA_df = Foods3_CA_df.sort_values(by="date")
In [75]:
Foods3_CA_df
Out[75]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2011-01-29 CA_1 2268 1.161917 4323.710449 0.0 0 0 0 0 0
1 2011-01-29 CA_2 1575 1.027044 3093.028076 0.0 0 0 0 0 0
2 2011-01-29 CA_3 2478 1.136871 4282.417969 0.0 0 0 0 0 0
3 2011-01-29 CA_4 759 1.047593 1566.507812 0.0 0 0 0 0 0
10533 2011-01-30 CA_4 798 1.047593 1600.170898 0.0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
10462 2016-05-21 CA_3 3102 2.919633 7805.052246 0.0 0 0 0 0 0
10471 2016-05-22 CA_2 2628 2.921383 6800.089844 0.0 0 0 0 0 0
10472 2016-05-22 CA_3 3573 2.919633 8930.868164 0.0 0 0 0 0 0
10473 2016-05-22 CA_4 1566 2.927146 4067.809082 0.0 0 0 0 0 0
10470 2016-05-22 CA_1 3274 2.926042 8687.856445 0.0 0 0 0 0 0

7764 rows × 11 columns

In [74]:
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7764 entries, 0 to 10470
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 7764 non-null   datetime64[ns]
 1   store_id             7764 non-null   category      
 2   unit_sold            7764 non-null   int16         
 3   sell_price           7764 non-null   float32       
 4   revenue              7764 non-null   float32       
 5   snap_CA              7764 non-null   float64       
 6   event1_Christmas     7764 non-null   int32         
 7   event1_Thanksgiving  7764 non-null   int32         
 8   event1_NewYear       7764 non-null   int32         
 9   event1_SuperBowl     7764 non-null   int32         
 10  event1_LaborDay      7764 non-null   int32         
dtypes: category(1), datetime64[ns](1), float32(2), float64(1), int16(1), int32(5)
memory usage: 417.4 KB
In [80]:
# Foods3_CA_df.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2011-2016_4 stores revenue in Foods3 department with 5 major events.csv")
In [482]:
Foods3_CA_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2011-2016_4 stores revenue in Foods3 department with 5 major events.csv')
In [487]:
Foods3_CA_df.drop('Unnamed: 0', axis=1, inplace=True)
In [488]:
Foods3_CA_df.head()
Out[488]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2011-01-29 CA_1 2268 1.161917 4323.7104 0.0 0 0 0 0 0
1 2011-01-29 CA_2 1575 1.027044 3093.0280 0.0 0 0 0 0 0
2 2011-01-29 CA_3 2478 1.136871 4282.4180 0.0 0 0 0 0 0
3 2011-01-29 CA_4 759 1.047593 1566.5078 0.0 0 0 0 0 0
4 2011-01-30 CA_4 798 1.047593 1600.1709 0.0 0 0 0 0 0
In [489]:
Foods3_CA_df.tail()
Out[489]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
7759 2016-05-21 CA_3 3102 2.919633 7805.0522 0.0 0 0 0 0 0
7760 2016-05-22 CA_2 2628 2.921383 6800.0900 0.0 0 0 0 0 0
7761 2016-05-22 CA_3 3573 2.919633 8930.8680 0.0 0 0 0 0 0
7762 2016-05-22 CA_4 1566 2.927146 4067.8090 0.0 0 0 0 0 0
7763 2016-05-22 CA_1 3274 2.926042 8687.8560 0.0 0 0 0 0 0

Since 2011 and 2016 have incomplete month data, we only use 2012 to 2015 for our project analysis¶

In [490]:
Foods3_CA_df["date"] = pd.to_datetime(Foods3_CA_df["date"], format="%Y-%m-%d")

Foods3_CA_df = Foods3_CA_df[
    (Foods3_CA_df["date"] >= pd.Timestamp("2012-01-01")) & 
    (Foods3_CA_df["date"] <= pd.Timestamp("2015-12-31"))
]
In [491]:
Foods3_CA_df.head()
Out[491]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
1348 2012-01-01 CA_1 1448 1.645839 3213.3958 1.0 0 0 1 0 0
1349 2012-01-01 CA_3 2051 1.630491 4246.9053 1.0 0 0 1 0 0
1350 2012-01-01 CA_4 738 1.589748 1637.3416 1.0 0 0 1 0 0
1351 2012-01-01 CA_2 941 1.473136 1994.1145 1.0 0 0 1 0 0
1352 2012-01-02 CA_1 2001 1.645839 4684.9550 1.0 0 0 0 0 0
In [492]:
Foods3_CA_df.tail()
Out[492]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
7187 2015-12-30 CA_2 1625 2.923421 3926.1920 0.0 0 0 0 0 0
7188 2015-12-31 CA_4 1072 2.924935 2814.8704 0.0 0 0 0 0 0
7189 2015-12-31 CA_1 1940 2.921556 5333.0360 0.0 0 0 0 0 0
7190 2015-12-31 CA_2 2087 2.923421 5141.0130 0.0 0 0 0 0 0
7191 2015-12-31 CA_3 2682 2.875478 6914.9814 0.0 0 0 0 0 0
In [498]:
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5844 entries, 1348 to 7191
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 5844 non-null   datetime64[ns]
 1   store_id             5844 non-null   object        
 2   unit_sold            5844 non-null   int64         
 3   sell_price           5844 non-null   float64       
 4   revenue              5844 non-null   float64       
 5   snap_CA              5844 non-null   float64       
 6   event1_Christmas     5844 non-null   int64         
 7   event1_Thanksgiving  5844 non-null   int64         
 8   event1_NewYear       5844 non-null   int64         
 9   event1_SuperBowl     5844 non-null   int64         
 10  event1_LaborDay      5844 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(6), object(1)
memory usage: 547.9+ KB
In [493]:
fig = px.line(
    Foods3_CA_df,
    x="date",
    y="revenue",
    color="store_id", 
    title="Time Series of Revenue Over Time",
    labels={"revenue": "Total Revenue", "date": "Date", "store_id": "Store ID"},
    markers=True 
)

fig.show()
In [494]:
fig = px.box(
    Foods3_CA_df, 
    x="store_id", 
    y="revenue", 
    title="Revenue Distribution by Store",
    labels={"revenue": "Total Revenue", "store_id": "Store ID"},
    notched=True  
)

fig.show()
In [506]:
# Remove unused columns
# Foods3_CA_df['store_id'] = Foods3_CA_df['store_id'].cat.remove_unused_categories()

# Calculate store revenue percentage
# 1. Sum revenue by store
store_revenue = Foods3_CA_df.groupby('store_id')['revenue'].sum().reset_index()

# 2. Calculate the overall total revenue
total_revenue = store_revenue['revenue'].sum()

# 3. Compute each store's revenue percentage
store_revenue['revenue_percentage'] = (store_revenue['revenue'] / total_revenue) * 100

store_revenue.style.format({'revenue': '{:,.2f}'})
Out[506]:
  store_id revenue revenue_percentage
0 CA_1 7,350,989.74 29.276478
1 CA_2 4,219,153.53 16.803445
2 CA_3 9,835,288.64 39.170590
3 CA_4 3,703,428.06 14.749487
In [507]:
fig = px.bar(
    store_revenue,
    x='store_id',
    y='revenue_percentage',
    title='Revenue Percentage by Store',
    labels={'store_id': 'Store ID', 'percentage': 'Revenue Percentage'},
    text='revenue_percentage',  # Display the percentage above each bar
    height=700,  # Increase the height
)

# Optionally format the text to show fewer decimals
fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')

fig.show()
In [508]:
# Aggregate total revenue by date
total_revenue_by_date = Foods3_CA_df.groupby("date", as_index=False)["revenue"].sum()

fig = px.line(
    total_revenue_by_date,
    x="date",
    y="revenue",
    title="Time Series of Total Revenue Over Time",
    labels={"revenue": "Total Revenue", "date": "Date"},
    markers=True  
)

fig.show()
In [509]:
total_revenue_by_date
Out[509]:
date revenue
0 2012-01-01 11091.7572
1 2012-01-02 16695.3801
2 2012-01-03 13604.3844
3 2012-01-04 12437.8751
4 2012-01-05 14248.9240
... ... ...
1456 2015-12-27 16427.9190
1457 2015-12-28 15311.9904
1458 2015-12-29 14675.1821
1459 2015-12-30 16201.4307
1460 2015-12-31 20203.9008

1461 rows × 2 columns

In [511]:
# total_revenue_by_date.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_Foods3 department revenue.csv")
In [512]:
Foods3_CA_df
Out[512]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
1348 2012-01-01 CA_1 1448 1.645839 3213.3958 1.0 0 0 1 0 0
1349 2012-01-01 CA_3 2051 1.630491 4246.9053 1.0 0 0 1 0 0
1350 2012-01-01 CA_4 738 1.589748 1637.3416 1.0 0 0 1 0 0
1351 2012-01-01 CA_2 941 1.473136 1994.1145 1.0 0 0 1 0 0
1352 2012-01-02 CA_1 2001 1.645839 4684.9550 1.0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
7187 2015-12-30 CA_2 1625 2.923421 3926.1920 0.0 0 0 0 0 0
7188 2015-12-31 CA_4 1072 2.924935 2814.8704 0.0 0 0 0 0 0
7189 2015-12-31 CA_1 1940 2.921556 5333.0360 0.0 0 0 0 0 0
7190 2015-12-31 CA_2 2087 2.923421 5141.0130 0.0 0 0 0 0 0
7191 2015-12-31 CA_3 2682 2.875478 6914.9814 0.0 0 0 0 0 0

5844 rows × 11 columns

In [513]:
# Foods3_CA_df.to_csv("C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_4 stores revenue in Foods3 department with 5 major events.csv")

Revenue Forecasting - Traditional Statistical Model (SARIMAX)¶

In [514]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import pmdarima as pm

# Load visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose

# Load specific forecasting tools
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from pmdarima import auto_arima # for determining ARIMA orders
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
In [11]:
# Foods3_CA_df = pd.read_csv('C:/Users/97610/Downloads/m5-forecasting-accuracy/time series data_2012-2015_4 stores revenue in Foods3 department with 5 major events.csv')
In [515]:
Foods3_CA_df
Out[515]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
1348 2012-01-01 CA_1 1448 1.645839 3213.3958 1.0 0 0 1 0 0
1349 2012-01-01 CA_3 2051 1.630491 4246.9053 1.0 0 0 1 0 0
1350 2012-01-01 CA_4 738 1.589748 1637.3416 1.0 0 0 1 0 0
1351 2012-01-01 CA_2 941 1.473136 1994.1145 1.0 0 0 1 0 0
1352 2012-01-02 CA_1 2001 1.645839 4684.9550 1.0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
7187 2015-12-30 CA_2 1625 2.923421 3926.1920 0.0 0 0 0 0 0
7188 2015-12-31 CA_4 1072 2.924935 2814.8704 0.0 0 0 0 0 0
7189 2015-12-31 CA_1 1940 2.921556 5333.0360 0.0 0 0 0 0 0
7190 2015-12-31 CA_2 2087 2.923421 5141.0130 0.0 0 0 0 0 0
7191 2015-12-31 CA_3 2682 2.875478 6914.9814 0.0 0 0 0 0 0

5844 rows × 11 columns

In [516]:
Foods3_CA_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5844 entries, 1348 to 7191
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 5844 non-null   datetime64[ns]
 1   store_id             5844 non-null   object        
 2   unit_sold            5844 non-null   int64         
 3   sell_price           5844 non-null   float64       
 4   revenue              5844 non-null   float64       
 5   snap_CA              5844 non-null   float64       
 6   event1_Christmas     5844 non-null   int64         
 7   event1_Thanksgiving  5844 non-null   int64         
 8   event1_NewYear       5844 non-null   int64         
 9   event1_SuperBowl     5844 non-null   int64         
 10  event1_LaborDay      5844 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(6), object(1)
memory usage: 547.9+ KB

Data Preprocessing¶

In [518]:
# Convert 'date' to datetime format
Foods3_CA_df['date'] = pd.to_datetime(Foods3_CA_df['date'])

# Drop unnecessary columns (like the unnamed index column)
# Foods3_CA_df.drop(columns=['Unnamed: 0'], inplace=True)

# Check for missing values
missing_values = Foods3_CA_df.isnull().sum()

# Aggregate data if needed (checking for duplicate entries per store per date)
duplicate_check = Foods3_CA_df.duplicated(subset=['date', 'store_id']).sum()

# Display updated dataframe info
Foods3_CA_df.info(), missing_values, duplicate_check
<class 'pandas.core.frame.DataFrame'>
Index: 5844 entries, 1348 to 7191
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 5844 non-null   datetime64[ns]
 1   store_id             5844 non-null   object        
 2   unit_sold            5844 non-null   int64         
 3   sell_price           5844 non-null   float64       
 4   revenue              5844 non-null   float64       
 5   snap_CA              5844 non-null   float64       
 6   event1_Christmas     5844 non-null   int64         
 7   event1_Thanksgiving  5844 non-null   int64         
 8   event1_NewYear       5844 non-null   int64         
 9   event1_SuperBowl     5844 non-null   int64         
 10  event1_LaborDay      5844 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(6), object(1)
memory usage: 547.9+ KB
Out[518]:
(None,
 date                   0
 store_id               0
 unit_sold              0
 sell_price             0
 revenue                0
 snap_CA                0
 event1_Christmas       0
 event1_Thanksgiving    0
 event1_NewYear         0
 event1_SuperBowl       0
 event1_LaborDay        0
 dtype: int64,
 0)

1. Foods3 Department Revenue Forecasting¶

Feature Engineering & EDA¶

In [544]:
# Aggregate data: Summing revenue per date across all stores
department_daily = Foods3_CA_df.groupby('date').agg({
    'revenue': 'sum',  # Total revenue per day
    'unit_sold': 'sum',  # Total units sold per day
    'sell_price': 'mean',  # Average selling price
    'snap_CA': 'mean',  # Average SNAP benefit indicator
    'event1_Christmas': 'max',  # If any store had the event on a given day
    'event1_Thanksgiving': 'max',
    'event1_NewYear': 'max',
    'event1_SuperBowl': 'max',
    'event1_LaborDay': 'max'
}).reset_index()
In [545]:
# Display daily aggregated data
department_daily
Out[545]:
date revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2012-01-01 11091.7572 5178 1.584803 1.0 0 0 1 0 0
1 2012-01-02 16695.3801 7517 1.584803 1.0 0 0 0 0 0
2 2012-01-03 13604.3844 6368 1.584803 1.0 0 0 0 0 0
3 2012-01-04 12437.8751 5748 1.584803 1.0 0 0 0 0 0
4 2012-01-05 14248.9240 6501 1.584803 1.0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
1456 2015-12-27 16427.9190 6212 2.911347 0.0 0 0 0 0 0
1457 2015-12-28 15311.9904 5924 2.911347 0.0 0 0 0 0 0
1458 2015-12-29 14675.1821 5680 2.911347 0.0 0 0 0 0 0
1459 2015-12-30 16201.4307 6189 2.911347 0.0 0 0 0 0 0
1460 2015-12-31 20203.9008 7781 2.911347 0.0 0 0 0 0 0

1461 rows × 10 columns

In [546]:
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(department_daily['date'], department_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [547]:
# Aggregate data: Create column "week" and summing revenue per week across all stores
Foods3_CA_df['week'] = Foods3_CA_df['date'].dt.to_period('W')
department_weekly = Foods3_CA_df.groupby('week').agg({
    'revenue': 'sum',
    'unit_sold': 'sum',
    'sell_price': 'mean',
    'snap_CA': 'mean',
    'event1_Christmas': 'max',
    'event1_Thanksgiving': 'max',
    'event1_NewYear': 'max',
    'event1_SuperBowl': 'max',
    'event1_LaborDay': 'max'
}).reset_index()

department_weekly = department_weekly.iloc[1:-1].reset_index(drop=True)
In [548]:
# Display weekly aggregated data
department_weekly
Out[548]:
week revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2012-01-02/2012-01-08 110698.152200 50362 1.588490 1.000000 0 0 0 0 0
1 2012-01-09/2012-01-15 102482.814800 47556 1.598924 0.285714 0 0 0 0 0
2 2012-01-16/2012-01-22 102863.599500 47503 1.604877 0.000000 0 0 0 0 0
3 2012-01-23/2012-01-29 93079.113900 43992 1.624494 0.000000 0 0 0 0 0
4 2012-01-30/2012-02-05 114189.451100 51928 1.657096 0.714286 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ...
203 2015-11-23/2015-11-29 111590.456300 45080 2.913383 0.000000 0 1 0 0 0
204 2015-11-30/2015-12-06 119007.117600 46183 2.912602 0.857143 0 0 0 0 0
205 2015-12-07/2015-12-13 115926.823100 45490 2.910630 0.571429 0 0 0 0 0
206 2015-12-14/2015-12-20 115783.140800 44612 2.910472 0.000000 0 0 0 0 0
207 2015-12-21/2015-12-27 99922.598308 38469 2.910264 0.000000 1 0 0 0 0

208 rows × 10 columns

In [561]:
# Convert the 'week' column from Period to Timestamp
# department_weekly['week'] = department_weekly['week'].dt.to_timestamp()

# Plot weekly revenue trend
plt.figure(figsize=(12,6))
plt.plot(department_weekly['week'], department_weekly['revenue'], label='Weekly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Weekly Revenue Trend Over Time')
plt.legend()
plt.show()
No description has been provided for this image
In [550]:
# Aggregate data: Create column "month" and summing revenue per month across all stores
Foods3_CA_df['month'] = Foods3_CA_df['date'].dt.to_period('M')
department_monthly = Foods3_CA_df.groupby('month').agg({
    'revenue': 'sum',
    'unit_sold': 'sum',
    'sell_price': 'mean',
    'snap_CA': 'mean',
    'event1_Christmas': 'max',
    'event1_Thanksgiving': 'max',
    'event1_NewYear': 'max',
    'event1_SuperBowl': 'max',
    'event1_LaborDay': 'max'
}).reset_index()
In [551]:
# Display monthly aggregated data
department_monthly
Out[551]:
month revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2012-01 446652.993300 206796 1.606872 0.322581 0 0 1 0 0
1 2012-02 457478.040800 205482 1.664356 0.344828 0 0 0 1 0
2 2012-03 527256.333700 234350 1.728166 0.322581 0 0 0 0 0
3 2012-04 515853.648100 232114 1.766734 0.333333 0 0 0 0 0
4 2012-05 553696.991600 252253 1.798810 0.322581 0 0 0 0 0
5 2012-06 550051.849800 255203 1.821972 0.333333 0 0 0 0 0
6 2012-07 533941.888900 247467 1.832073 0.322581 0 0 0 0 0
7 2012-08 565252.727200 257153 1.841301 0.322581 0 0 0 0 0
8 2012-09 514575.798200 233260 1.853635 0.333333 0 0 0 0 1
9 2012-10 463063.788800 216610 1.886128 0.322581 0 0 0 0 0
10 2012-11 409666.099600 193544 1.908485 0.333333 0 1 0 0 0
11 2012-12 449432.537416 208390 1.939479 0.322581 1 0 0 0 0
12 2013-01 451909.413200 204266 1.977420 0.322581 0 0 1 0 0
13 2013-02 416719.174500 189610 2.021923 0.357143 0 0 0 1 0
14 2013-03 469418.195600 218046 2.063936 0.322581 0 0 0 0 0
15 2013-04 469187.201400 218086 2.096399 0.333333 0 0 0 0 0
16 2013-05 483007.078600 226260 2.132688 0.322581 0 0 0 0 0
17 2013-06 533400.292800 251893 2.164049 0.333333 0 0 0 0 0
18 2013-07 552472.551900 260784 2.181937 0.322581 0 0 0 0 0
19 2013-08 565352.365700 274857 2.210481 0.322581 0 0 0 0 0
20 2013-09 557284.346200 271261 2.276460 0.333333 0 0 0 0 1
21 2013-10 565761.394900 251945 2.335426 0.322581 0 0 0 0 0
22 2013-11 495937.825500 227489 2.343453 0.333333 0 1 0 0 0
23 2013-12 484834.025153 223967 2.351611 0.322581 1 0 0 0 0
24 2014-01 502707.159500 231600 2.399634 0.322581 0 0 1 0 0
25 2014-02 469008.767100 204153 2.450571 0.357143 0 0 0 1 0
26 2014-03 567044.516300 243631 2.500773 0.322581 0 0 0 0 0
27 2014-04 582852.901900 252596 2.531435 0.333333 0 0 0 0 0
28 2014-05 582902.376400 248245 2.554456 0.322581 0 0 0 0 0
29 2014-06 560458.577300 253103 2.592864 0.333333 0 0 0 0 0
30 2014-07 587890.839800 264626 2.627328 0.322581 0 0 0 0 0
31 2014-08 587159.544800 260661 2.650960 0.322581 0 0 0 0 0
32 2014-09 552359.528300 242877 2.689410 0.333333 0 0 0 0 1
33 2014-10 571807.364100 230924 2.718435 0.322581 0 0 0 0 0
34 2014-11 540479.641000 219940 2.747414 0.333333 0 1 0 0 0
35 2014-12 491734.536911 193572 2.756147 0.322581 1 0 0 0 0
36 2015-01 525618.729000 215120 2.792603 0.322581 0 0 1 0 0
37 2015-02 469823.563900 189967 2.831282 0.357143 0 0 0 1 0
38 2015-03 536504.483700 216395 2.858263 0.322581 0 0 0 0 0
39 2015-04 515726.202600 204760 2.870039 0.333333 0 0 0 0 0
40 2015-05 533872.967000 212946 2.879448 0.322581 0 0 0 0 0
41 2015-06 535634.996200 232264 2.889860 0.333333 0 0 0 0 0
42 2015-07 578047.229000 245723 2.890509 0.322581 0 0 0 0 0
43 2015-08 592465.328800 250992 2.895919 0.322581 0 0 0 0 0
44 2015-09 568701.797800 239233 2.904005 0.333333 0 0 0 0 1
45 2015-10 603689.184300 245515 2.913604 0.322581 0 0 0 0 0
46 2015-11 518822.576500 206328 2.913687 0.333333 0 1 0 0 0
47 2015-12 501340.598508 194361 2.910960 0.322581 1 0 0 0 0
In [563]:
# Convert the 'month' column from Period to Timestamp
department_monthly['month'] = department_monthly['month'].dt.to_timestamp()

plt.figure(figsize=(12,6))
plt.plot(department_monthly['month'], department_monthly['revenue'], label='Monthly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Monthly Revenue Trend Over Time')
plt.legend()
plt.show()
No description has been provided for this image
In [553]:
# Aggregate data: Create column "year" and summing revenue per year across all stores
Foods3_CA_df['year'] = Foods3_CA_df['date'].dt.to_period('Y')
department_yearly = Foods3_CA_df.groupby('year').agg({
    'revenue': 'sum',
    'unit_sold': 'sum',
    'sell_price': 'mean',
    'snap_CA': 'mean',
    'event1_Christmas': 'max',
    'event1_Thanksgiving': 'max',
    'event1_NewYear': 'max',
    'event1_SuperBowl': 'max',
    'event1_LaborDay': 'max'
}).reset_index()
In [554]:
# Display yearly aggregated data
department_yearly
Out[554]:
year revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
0 2012 5.986923e+06 2742622 1.804395 0.327869 1 1 1 1 1
1 2013 6.045284e+06 2818464 2.180502 0.328767 1 1 1 1 1
2 2014 6.596406e+06 2845928 2.602437 0.328767 1 1 1 1 1
3 2015 6.480248e+06 2653604 2.879409 0.328767 1 1 1 1 1
In [564]:
# Convert the 'year' column from Period to Timestamp
department_yearly['year'] = department_yearly['year'].dt.to_timestamp()

plt.figure(figsize=(12,6))
plt.plot(department_yearly['year'], department_yearly['revenue'], label='Yearly Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Yearly Revenue Trend Over Time')
plt.legend()
plt.show()
No description has been provided for this image
In [565]:
Foods3_CA_df
Out[565]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay week month year
1348 2012-01-01 CA_1 1448 1.645839 3213.3958 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1349 2012-01-01 CA_3 2051 1.630491 4246.9053 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1350 2012-01-01 CA_4 738 1.589748 1637.3416 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1351 2012-01-01 CA_2 941 1.473136 1994.1145 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1352 2012-01-02 CA_1 2001 1.645839 4684.9550 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7187 2015-12-30 CA_2 1625 2.923421 3926.1920 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7188 2015-12-31 CA_4 1072 2.924935 2814.8704 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7189 2015-12-31 CA_1 1940 2.921556 5333.0360 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7190 2015-12-31 CA_2 2087 2.923421 5141.0130 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7191 2015-12-31 CA_3 2682 2.875478 6914.9814 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015

5844 rows × 14 columns

Proceed With Department Daily Data (Since Event Happens Daily)¶

In [566]:
department_daily.describe()
Out[566]:
date revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
count 1461 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 2013-12-31 00:00:00 17186.078011 7570.580424 2.366301 0.328542 0.002738 0.002738 0.002738 0.002738 0.002738
min 2012-01-01 00:00:00 8.299316 5.000000 1.584803 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2012-12-31 00:00:00 14804.731200 6466.000000 1.950130 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 2013-12-31 00:00:00 16695.380100 7332.000000 2.368837 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2014-12-31 00:00:00 19387.146800 8571.000000 2.763858 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 2015-12-31 00:00:00 27796.757300 12852.000000 2.915013 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
std NaN 3457.469523 1538.943798 0.420759 0.469844 0.052271 0.052271 0.052271 0.052271 0.052271
In [567]:
department_daily.shape
Out[567]:
(1461, 10)
In [569]:
# Plot histograms for each feature (Ignore first plot - date column)
department_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [570]:
# Select only numeric columns from the DataFrame
numeric_df = department_daily.select_dtypes(include=['number'])

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.show()
No description has been provided for this image

unit_sold and revenue are highly positive correlated, indicating higher units sold directly cause higher revenue

sell_price and revenue also have moderately positive correlation, implying higher price also contribute to higher revenue

sell_price and unit_sold have slightly positive correlation, which is counter-intuitive, since higher price should lead to lower sales

event1_Christmas, event1_Thanksgiving, event1_NewYear are negatively correlated with revenue, meaning people stay at home and buy less on those event days, it could also be store closure on those days (especially for Christmas, which has the most nagetive impact on revenue)

event1_SuperBowl and event1_LaborDay are positively correlated with revenue, meaning people are buying more on those days, but the effect is very slight

snap_CA slightly contributes to revenue and unit_sold; interestingly, people are using less SNAP on Christmas and Thanksgiving, but using more SNAP on New Year (I guess the reason could be that stores close on Christmas and Thanksgiving, but open on New Year)

In [571]:
# Check for seasonality using a rolling mean
department_daily['revenue_rolling_mean'] = department_daily['revenue'].rolling(window=30, min_periods=1).mean()

# Plot rolling mean trend
plt.figure(figsize=(12,6))
plt.plot(department_daily['date'], department_daily['revenue'], label='Daily Revenue', alpha=0.5)
plt.plot(department_daily['date'], department_daily['revenue_rolling_mean'], label='30-Day Rolling Mean', color='red')
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('Foods 3 Department Daily Revenue Trend with 30-Day Rolling Mean')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The plot indicates strong seasonality with predictable peaks around key events and a slightly rising baseline over the years. The 30-day rolling mean is especially helpful for seeing the bigger picture behind the daily volatility¶

In [572]:
# Perform time series decomposition
decomposition = seasonal_decompose(department_daily.set_index('date')['revenue'], model='additive', period=365)

# Plot decomposition components
plt.figure(figsize=(12,8))

plt.subplot(4, 1, 1)
plt.plot(decomposition.observed, label='Observed', color='black')
plt.legend(loc='upper left')

plt.subplot(4, 1, 2)
plt.plot(decomposition.trend, label='Trend', color='blue')
plt.legend(loc='upper left')

plt.subplot(4, 1, 3)
plt.plot(decomposition.seasonal, label='Seasonal', color='green')
plt.legend(loc='upper left')

plt.subplot(4, 1, 4)
plt.plot(decomposition.resid, label='Residual', color='red')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()
No description has been provided for this image

Both a gradual, long-term trend and recurring seasonal patterns play significant roles in this revenue time series, with the remaining “residual” fluctuations being relatively random day-to-day noise¶

In [573]:
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(department_daily['revenue'])

# Display the seasonality strength
seasonality_strength
Out[573]:
0.5453845443618595

Roughly 55% of the variance in department revenue series is explained by seasonality, indicating a strong degree of seasonality¶

In [574]:
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(department_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()

plt.figure(figsize=(12,4))
plot_pacf(department_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
No description has been provided for this image
No description has been provided for this image

PACF is p or AR, pretty obvious that p is 1

ACF is q or MA, not obvious that maybe q is 1 as well

Begin Modeling¶

Apply ADF test to check stationarity¶

In [575]:
# Automate Augmented Dickey-Fuller (ADF) test for stationarity

def adf_test(series, title=''):
    """
    Pass in a time series and an optional title, returns an ADF report.
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(), autolag='AIC')  # .dropna() handles differenced data
    
    labels = ['ADF test statistic', 'p-value', '# lags used', '# observations']
    out = pd.Series(result[0:4], index=labels)

    for key, val in result[4].items():
        out[f'critical value ({key})'] = val
        
    print(out.to_string())  # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")
In [576]:
adf_test(department_daily['revenue'])
Augmented Dickey-Fuller Test: 
ADF test statistic        -3.314950
p-value                    0.014223
# lags used               24.000000
# observations          1436.000000
critical value (1%)       -3.434912
critical value (5%)       -2.863555
critical value (10%)      -2.567843
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps

In [577]:
# # Apply first-order differencing
# department_daily['revenue_diff'] = department_daily['revenue'].diff()

# # Perform ADF test again on the differenced series
# adf_test(department_daily['revenue_diff'])
Augmented Dickey-Fuller Test: 
ADF test statistic     -1.420797e+01
p-value                 1.741457e-26
# lags used             2.400000e+01
# observations          1.435000e+03
critical value (1%)    -3.434915e+00
critical value (5%)    -2.863556e+00
critical value (10%)   -2.567843e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
In [578]:
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(department_daily['date'], department_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('Date')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
No description has been provided for this image

Since we checked seasonality strength (55%) and data has 5 exogenous variables, we proceed with SARIMAX:¶

In [582]:
department_daily
Out[582]:
date revenue unit_sold sell_price snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay revenue_rolling_mean
0 2012-01-01 11091.7572 5178 1.584803 1.0 0 0 1 0 0 11091.757200
1 2012-01-02 16695.3801 7517 1.584803 1.0 0 0 0 0 0 13893.568650
2 2012-01-03 13604.3844 6368 1.584803 1.0 0 0 0 0 0 13797.173900
3 2012-01-04 12437.8751 5748 1.584803 1.0 0 0 0 0 0 13457.349200
4 2012-01-05 14248.9240 6501 1.584803 1.0 0 0 0 0 0 13615.664160
... ... ... ... ... ... ... ... ... ... ... ...
1456 2015-12-27 16427.9190 6212 2.911347 0.0 0 0 0 0 0 16137.524387
1457 2015-12-28 15311.9904 5924 2.911347 0.0 0 0 0 0 0 16148.145014
1458 2015-12-29 14675.1821 5680 2.911347 0.0 0 0 0 0 0 16020.895077
1459 2015-12-30 16201.4307 6189 2.911347 0.0 0 0 0 0 0 16037.889924
1460 2015-12-31 20203.9008 7781 2.911347 0.0 0 0 0 0 0 16207.451810

1461 rows × 11 columns

In [583]:
department_daily.isnull().sum()
Out[583]:
date                    0
revenue                 0
unit_sold               0
sell_price              0
snap_CA                 0
event1_Christmas        0
event1_Thanksgiving     0
event1_NewYear          0
event1_SuperBowl        0
event1_LaborDay         0
revenue_rolling_mean    0
dtype: int64
In [197]:
# #### When computing a differenced column, the first value naturally becomes NaN because there is no previous value to subtract from. Thus, we fill it with 0
# department_daily["revenue_diff"].fillna(0, inplace=True)
C:\Users\97610\AppData\Local\Temp\ipykernel_8664\3669965985.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  department_daily["revenue_diff"].fillna(0, inplace=True)
In [133]:
# # Define the endogenous (target) variable and exogenous variables
# y = department_daily.set_index('date')['revenue']
# exo = department_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# # Use Auto ARIMA to determine optimal SARIMAX parameters
# auto_arima_model = pm.auto_arima(
#     y, 
#     exogenous=exo, 
#     seasonal=True, 
#     m=365,  # Assuming daily seasonality
#     trace=True, 
#     error_action='ignore', 
#     suppress_warnings=True, 
#     stepwise=True,
#     max_p=2,     # maximum non-seasonal AR terms
#     max_q=3,     # maximum non-seasonal MA terms
#     max_P=2,     # maximum seasonal AR terms
#     max_Q=3      # maximum seasonal MA terms
# )

# # Extract the best parameters
# optimal_order = auto_arima_model.order
# optimal_seasonal_order = auto_arima_model.seasonal_order

# optimal_order, optimal_seasonal_order

Since having daily seasonality will cause auto ARIMA to run forever, we use weekly seasonality instead:¶

In [652]:
# Define the endogenous (target) variable and exogenous variables
y = department_daily.set_index('date')['revenue']
exo = department_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
    y, 
    exogenous=exo, 
    stationary=True,  # Tells auto_arima that the series is already stationary
    seasonal=True, 
    m=7,  # Assuming weekly seasonality
    trace=True, 
    error_action='ignore', 
    suppress_warnings=True, 
    stepwise=True
)

# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order

optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,0,1)[7] intercept   : AIC=26357.188, Time=4.05 sec
 ARIMA(0,0,0)(0,0,0)[7] intercept   : AIC=27958.448, Time=0.03 sec
 ARIMA(1,0,0)(1,0,0)[7] intercept   : AIC=26683.636, Time=1.37 sec
 ARIMA(0,0,1)(0,0,1)[7] intercept   : AIC=26872.441, Time=0.76 sec
 ARIMA(0,0,0)(0,0,0)[7]             : AIC=32700.985, Time=0.02 sec
 ARIMA(2,0,2)(0,0,1)[7] intercept   : AIC=26817.098, Time=1.68 sec
 ARIMA(2,0,2)(1,0,0)[7] intercept   : AIC=26473.888, Time=2.71 sec
 ARIMA(2,0,2)(2,0,1)[7] intercept   : AIC=inf, Time=4.48 sec
 ARIMA(2,0,2)(1,0,2)[7] intercept   : AIC=inf, Time=5.44 sec
 ARIMA(2,0,2)(0,0,0)[7] intercept   : AIC=27219.512, Time=0.94 sec
 ARIMA(2,0,2)(0,0,2)[7] intercept   : AIC=26635.132, Time=2.41 sec
 ARIMA(2,0,2)(2,0,0)[7] intercept   : AIC=26366.907, Time=5.39 sec
 ARIMA(2,0,2)(2,0,2)[7] intercept   : AIC=inf, Time=6.02 sec
 ARIMA(1,0,2)(1,0,1)[7] intercept   : AIC=inf, Time=3.33 sec
 ARIMA(2,0,1)(1,0,1)[7] intercept   : AIC=inf, Time=1.51 sec
 ARIMA(3,0,2)(1,0,1)[7] intercept   : AIC=inf, Time=3.66 sec
 ARIMA(2,0,3)(1,0,1)[7] intercept   : AIC=26181.941, Time=3.63 sec
 ARIMA(2,0,3)(0,0,1)[7] intercept   : AIC=26856.728, Time=2.07 sec
 ARIMA(2,0,3)(1,0,0)[7] intercept   : AIC=26373.643, Time=2.34 sec
 ARIMA(2,0,3)(2,0,1)[7] intercept   : AIC=inf, Time=5.54 sec
 ARIMA(2,0,3)(1,0,2)[7] intercept   : AIC=inf, Time=6.28 sec
 ARIMA(2,0,3)(0,0,0)[7] intercept   : AIC=27329.799, Time=1.26 sec
 ARIMA(2,0,3)(0,0,2)[7] intercept   : AIC=26632.885, Time=3.52 sec
 ARIMA(2,0,3)(2,0,0)[7] intercept   : AIC=26232.591, Time=4.99 sec
 ARIMA(2,0,3)(2,0,2)[7] intercept   : AIC=26161.978, Time=6.82 sec
 ARIMA(1,0,3)(2,0,2)[7] intercept   : AIC=inf, Time=6.04 sec
 ARIMA(3,0,3)(2,0,2)[7] intercept   : AIC=inf, Time=9.38 sec
 ARIMA(2,0,4)(2,0,2)[7] intercept   : AIC=inf, Time=9.49 sec
 ARIMA(1,0,2)(2,0,2)[7] intercept   : AIC=inf, Time=6.60 sec
 ARIMA(1,0,4)(2,0,2)[7] intercept   : AIC=inf, Time=8.22 sec
 ARIMA(3,0,2)(2,0,2)[7] intercept   : AIC=inf, Time=6.03 sec
 ARIMA(3,0,4)(2,0,2)[7] intercept   : AIC=inf, Time=10.39 sec
 ARIMA(2,0,3)(2,0,2)[7]             : AIC=inf, Time=6.51 sec

Best model:  ARIMA(2,0,3)(2,0,2)[7] intercept
Total fit time: 142.937 seconds
Out[652]:
((2, 0, 3), (2, 0, 2, 7))
In [653]:
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
Out[653]:
SARIMAX Results
Dep. Variable: y No. Observations: 1461
Model: SARIMAX(2, 0, 3)x(2, 0, [1, 2], 7) Log Likelihood -13069.989
Date: Wed, 12 Mar 2025 AIC 26161.978
Time: 19:25:48 BIC 26220.133
Sample: 01-01-2012 HQIC 26183.671
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 7842.3152 7732.518 1.014 0.310 -7313.141 2.3e+04
ar.L1 -1.6668 0.052 -32.317 0.000 -1.768 -1.566
ar.L2 -0.9272 0.051 -18.150 0.000 -1.027 -0.827
ma.L1 2.0575 0.058 35.404 0.000 1.944 2.171
ma.L2 1.5678 0.089 17.668 0.000 1.394 1.742
ma.L3 0.3439 0.039 8.807 0.000 0.267 0.420
ar.S.L7 0.2302 1.583 0.145 0.884 -2.873 3.333
ar.S.L14 0.6439 1.462 0.440 0.660 -2.221 3.509
ma.S.L7 0.1127 1.573 0.072 0.943 -2.971 3.196
ma.S.L14 -0.4179 0.910 -0.459 0.646 -2.201 1.365
sigma2 4.262e+06 5.207 8.19e+05 0.000 4.26e+06 4.26e+06
Ljung-Box (L1) (Q): 6.70 Jarque-Bera (JB): 7617.82
Prob(Q): 0.01 Prob(JB): 0.00
Heteroskedasticity (H): 1.22 Skew: -1.00
Prob(H) (two-sided): 0.03 Kurtosis: 14.01


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 8.7e+21. Standard errors may be unstable.
In [654]:
# # Define the orders based on the auto_arima output:
# order = (1, 0, 1)
# seasonal_order = (1, 0, 1, 7)

# # Initialize and fit the SARIMAX model
# model = SARIMAX(weekly_df['num_orders'], order=order, seasonal_order=seasonal_order)
# results = model.fit()
# print(results.summary())
In [655]:
# Fit the SARIMAX model
sarimax_model = SARIMAX(y, 
                        exog=exo, 
                        order=optimal_order, 
                        seasonal_order=optimal_seasonal_order, 
                        enforce_stationarity=False, 
                        enforce_invertibility=False)

sarimax_results = sarimax_model.fit()

# Display model summary
sarimax_results.summary()
Out[655]:
SARIMAX Results
Dep. Variable: revenue No. Observations: 1461
Model: SARIMAX(2, 0, 3)x(2, 0, [1, 2], 7) Log Likelihood -13353.446
Date: Wed, 12 Mar 2025 AIC 26736.892
Time: 19:26:12 BIC 26816.009
Sample: 01-01-2012 HQIC 26766.423
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
event1_Christmas 10.3751 336.898 0.031 0.975 -649.932 670.682
event1_Thanksgiving 1.115e+04 340.184 32.785 0.000 1.05e+04 1.18e+04
event1_NewYear 1.131e+04 573.941 19.702 0.000 1.02e+04 1.24e+04
event1_SuperBowl 2.133e+04 390.929 54.554 0.000 2.06e+04 2.21e+04
event1_LaborDay 2.245e+04 275.143 81.594 0.000 2.19e+04 2.3e+04
ar.L1 1.4789 0.367 4.033 0.000 0.760 2.198
ar.L2 -0.4870 0.359 -1.355 0.175 -1.191 0.217
ma.L1 -1.2754 0.368 -3.466 0.001 -1.997 -0.554
ma.L2 0.3446 0.287 1.200 0.230 -0.218 0.907
ma.L3 -0.0221 0.053 -0.415 0.678 -0.126 0.082
ar.S.L7 0.8671 0.259 3.348 0.001 0.360 1.375
ar.S.L14 0.1335 0.259 0.515 0.606 -0.374 0.641
ma.S.L7 -0.7419 0.263 -2.824 0.005 -1.257 -0.227
ma.S.L14 -0.2315 0.251 -0.923 0.356 -0.723 0.260
sigma2 8.145e+06 0.970 8.4e+06 0.000 8.14e+06 8.14e+06
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 41086.87
Prob(Q): 0.94 Prob(JB): 0.00
Heteroskedasticity (H): 1.02 Skew: -3.58
Prob(H) (two-sided): 0.80 Kurtosis: 28.14


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.9e+20. Standard errors may be unstable.
In [656]:
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
No description has been provided for this image

Residual Randomness: The standardized residuals seem fairly random and centered around zero

Normality: The histogram and Q-Q plot suggest residuals are not drastically non-normal. We might see slight deviations in the tails (common in real-world data)

No Major Autocorrelation: The correlogram shows no large spikes outside the confidence intervals, indicating the model has likely captured the time dependencies

These diagnostics collectively suggest the model is doing a reasonably good job¶

In [657]:
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
No description has been provided for this image
In [658]:
# Actual values (endogenous variable)
y_true = y  # or test set if we're evaluating out-of-sample

# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f'MAPE: {mape:.2f}%')
MAPE: 404.34%

Model Evaluation¶

(1) Train-Test Validation¶

In [659]:
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y, 
                              exog=train_exo, 
                              order=optimal_order, 
                              seasonal_order=optimal_seasonal_order, 
                              enforce_stationarity=False, 
                              enforce_invertibility=False)

sarimax_results_train = sarimax_model_train.fit()

# Forecast on test set
Foods3_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)

(2) Check Distribution of MAPE¶

In [660]:
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = Foods3_predictions

# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)

# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100

# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
    'Actual': actual_values,
    'Forecast': forecast_values,
    'MAPE': mape_row_by_row
})

# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)

# Display the DataFrame
print(df_results)

# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
                Actual      Forecast       MAPE
2015-03-14  20577.9210  20980.513741   1.956431
2015-03-15  24262.6775  21910.661523   9.693967
2015-03-16  16995.1786  17126.601868   0.773297
2015-03-17  14918.4433  15536.951373   4.145929
2015-03-18  13953.3844  15063.428980   7.955379
...                ...           ...        ...
2015-12-27  16427.9190  22474.081068  36.804187
2015-12-28  15311.9904  17717.676021  15.711123
2015-12-29  14675.1821  16146.088998  10.023091
2015-12-30  16201.4307  15724.033126   2.946638
2015-12-31  20203.9008  14867.015155  26.415125

[293 rows x 3 columns]
Overall MAPE: 8.66%

(3) Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match¶

In [661]:
# Plot predicted results 1
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, Foods3_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel(" ")
plt.ylabel("Revenue")
plt.title("State Level (CA) Foods3 Department Revenue (2015-2016) Forecasting with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [662]:
# Plot predicted results 2
title = 'State Level (CA) Foods3 Department Revenue (2015-2016) Forecasting with 5 Major Events'
ylabel= 'Revenue'
xlabel= ''  # we don't really need a label here

ax = test_y.plot(legend=True, figsize=(12, 6), title=title)
Foods3_predictions.plot(legend=True, ax=ax)  # Make sure predictions plot on the same axis
ax.autoscale(axis='x', tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)

# Set y-axis to start at zero
ax.set_ylim(bottom=0)  # This automatically adjusts the upper limit as needed

plt.show()
No description has been provided for this image

(4) Check Evaluation Metrics¶

In [663]:
# Compute evaluation metrics
mse = mean_squared_error(test_y, Foods3_predictions)
rmse = np.sqrt(mean_squared_error(test_y, Foods3_predictions))
mae = mean_absolute_error(test_y, Foods3_predictions)
mape = np.mean(np.abs((test_y - Foods3_predictions) / test_y)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 6068119.54
Root Mean Squared Error (RMSE): 2463.36
Mean Absolute Error (MAE): 1553.20
Mean Absolute Percentage Error (MAPE): 525.80%
In [664]:
mape_value = mean_absolute_percentage_error(
    test_y,
    Foods3_predictions
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 525.80%

2. Top-Down Approach¶

Recall Revenue Percentage by Store (All Time 2012/1/1-2015/12/31)¶

CA_1: 29%

CA_2: 17%

CA_3: 39%

CA_4: 15%

Since the above is store revenue % of all time, We need store revenue % of same forecast period last year (2014/03/14 -2014/12/31)¶

(Notice we shouldn't use store revenue % of actual forecast period, since we assume it is future data that we don't know currently, using that data will cause data leakage though we have it)¶

In [666]:
# 1. Define the date range
start_date = '2014-03-14'
end_date   = '2014-12-31'

# 2. Filter rows to the specified date range
df_filtered = Foods3_CA_df[
    (Foods3_CA_df['date'] >= start_date) &
    (Foods3_CA_df['date'] <= end_date)
]

# 3. Group by 'store_id' and sum 'revenue'
store_revenue_sums = df_filtered.groupby('store_id')['revenue'].sum()

# 4. Calculate the total revenue across all stores
total_revenue_period = store_revenue_sums.sum()

# 5. Calculate each store's share of total revenue
store_shares = store_revenue_sums / total_revenue_period

# 6. Display the results
store_shares.round(2)
Out[666]:
store_id
CA_1    0.29
CA_2    0.15
CA_3    0.41
CA_4    0.15
Name: revenue, dtype: float64

Revenue Percentage by Store (2014/3/14-2014/12/31)¶

CA_1: 29%

CA_2: 15%

CA_3: 41%

CA_4: 15%

(1) CA_1 Store Top-Down¶

  • Calculate CA_1 Predicted Revenue
In [667]:
# Extract CA_1 percentage
store_CA_1 = store_shares.loc['CA_1']

# Get CA_1 store revenue forecast
forecast_CA_1 = Foods3_predictions * store_CA_1
forecast_CA_1
Out[667]:
2015-03-14    6127.013528
2015-03-15    6398.647870
2015-03-16    5001.542032
2015-03-17    4537.310784
2015-03-18    4399.026367
                 ...     
2015-12-27    6563.185270
2015-12-28    5174.155505
2015-12-29    4715.199396
2015-12-30    4591.944929
2015-12-31    4341.666944
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
  • Get CA_1 Actual Revenue
In [668]:
# 1. Filter for store_id == 'CA_1'
df_CA1 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_1'].copy()

# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date   = '2015-12-31'
mask = (df_CA1['date'] >= start_date) & (df_CA1['date'] <= end_date)
df_CA1_filtered = df_CA1.loc[mask]

# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_1 = df_CA1_filtered[['date', 'revenue']].sort_values('date')

# 4. Set date column as index
actual_CA_1.set_index('date', inplace=True)

# 5. Print time series of actual revenue
actual_CA_1
Out[668]:
revenue
date
2015-03-14 6141.6230
2015-03-15 7154.1210
2015-03-16 4833.7110
2015-03-17 3950.7646
2015-03-18 4107.2850
... ...
2015-12-27 4446.1396
2015-12-28 4131.3706
2015-12-29 4129.5820
2015-12-30 4345.4190
2015-12-31 5333.0360

293 rows × 1 columns

  • Plot CA_1 Actual VS Forecast
In [769]:
# Plot actual VS forecast
plt.figure(figsize=(10, 6))

plt.plot(actual_CA_1.index, actual_CA_1, label='Actual CA_1 Revenue')
plt.plot(forecast_CA_1.index, forecast_CA_1, label='Forecast CA_1 Revenue')

# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_1 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
  • Evaluate CA_1 Top-Down Forecast Result
In [671]:
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_1, forecast_CA_1)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_1, forecast_CA_1)
mape = np.mean(np.abs((actual_CA_1 - forecast_CA_1) / actual_CA_1)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 674801.37
Root Mean Squared Error (RMSE): 821.46
Mean Absolute Error (MAE): 559.64
Mean Absolute Percentage Error (MAPE): nan%
In [672]:
mape_value = mean_absolute_percentage_error(
    actual_CA_1,
    forecast_CA_1
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 7846370137560414208.00%

(2) CA_2 Store Top-Down¶

  • Calculate CA_2 Predicted Revenue
In [673]:
# Extract CA_2 percentage
store_CA_2 = store_shares.loc['CA_2']

# Get CA_2 store revenue forecast
forecast_CA_2 = Foods3_predictions * store_CA_2
forecast_CA_2
Out[673]:
2015-03-14    3196.456026
2015-03-15    3338.167355
2015-03-16    2609.298820
2015-03-17    2367.109903
2015-03-18    2294.967079
                 ...     
2015-12-27    3424.006330
2015-12-28    2699.351073
2015-12-29    2459.914191
2015-12-30    2395.612475
2015-12-31    2265.042733
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
  • Get CA_2 Actual Revenue
In [674]:
# 1. Filter for store_id == 'CA_2'
df_CA2 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_2'].copy()

# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date   = '2015-12-31'
mask = (df_CA2['date'] >= start_date) & (df_CA2['date'] <= end_date)
df_CA2_filtered = df_CA2.loc[mask]

# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_2 = df_CA2_filtered[['date', 'revenue']].sort_values('date')

# 4. Set date column as index
actual_CA_2.set_index('date', inplace=True)

# 5. Print time series of actual revenue
actual_CA_2
Out[674]:
revenue
date
2015-03-14 3014.8530
2015-03-15 3557.4578
2015-03-16 1831.1394
2015-03-17 1876.0859
2015-03-18 1772.1809
... ...
2015-12-27 4265.5730
2015-12-28 3385.6113
2015-12-29 3380.6145
2015-12-30 3926.1920
2015-12-31 5141.0130

293 rows × 1 columns

  • Plot CA_2 Actual VS Forecast
In [818]:
# Plot actual VS forecast
plt.figure(figsize=(10, 6))

plt.plot(actual_CA_2.index, actual_CA_2, label='Actual CA_2 Revenue')
plt.plot(forecast_CA_2.index, forecast_CA_2, label='Forecast CA_2 Revenue')

# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_2 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
  • Evaluate CA_2 Top-Down Forecast Result
In [676]:
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_2, forecast_CA_2)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_2, forecast_CA_2)
mape = np.mean(np.abs((actual_CA_2 - forecast_CA_2) / actual_CA_2)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 1572627.38
Root Mean Squared Error (RMSE): 1254.04
Mean Absolute Error (MAE): 1021.30
Mean Absolute Percentage Error (MAPE): nan%
In [677]:
mape_value = mean_absolute_percentage_error(
    actual_CA_2,
    forecast_CA_2
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 281.42%

(3) CA_3 Store Top-Down¶

  • Calculate CA_3 Predicted Revenue
In [678]:
# Extract CA_3 percentage
store_CA_3 = store_shares.loc['CA_3']

# Get CA_3 store revenue forecast
forecast_CA_3 = Foods3_predictions * store_CA_3
forecast_CA_3
Out[678]:
2015-03-14    8538.410280
2015-03-15    8916.951219
2015-03-16    6969.989165
2015-03-17    6323.051330
2015-03-18    6130.342585
                 ...     
2015-12-27    9146.245282
2015-12-28    7210.537785
2015-12-29    6570.951216
2015-12-30    6399.187729
2015-12-31    6050.408327
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
  • Get CA_3 Actual Revenue
In [679]:
# 1. Filter for store_id == 'CA_3'
df_CA3 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_3'].copy()

# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date   = '2015-12-31'
mask = (df_CA3['date'] >= start_date) & (df_CA3['date'] <= end_date)
df_CA3_filtered = df_CA3.loc[mask]

# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_3 = df_CA3_filtered[['date', 'revenue']].sort_values('date')

# 4. Set date column as index
actual_CA_3.set_index('date', inplace=True)

# 5. Print time series of actual revenue
actual_CA_3
Out[679]:
revenue
date
2015-03-14 8467.2440
2015-03-15 10116.2220
2015-03-16 7627.6895
2015-03-17 6562.2295
2015-03-18 5713.6250
... ...
2015-12-27 5824.9710
2015-12-28 5684.8975
2015-12-29 5417.0146
2015-12-30 5948.0480
2015-12-31 6914.9814

293 rows × 1 columns

  • Plot CA_3 Actual VS Forecast
In [792]:
# Plot actual VS forecast
plt.figure(figsize=(10, 6))

plt.plot(actual_CA_3.index, actual_CA_3, label='Actual CA_3 Revenue')
plt.plot(forecast_CA_3.index, forecast_CA_3, label='Forecast CA_3 Revenue')

# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_3 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
  • Evaluate CA_3 Top-Down Forecast Result
In [681]:
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_3, forecast_CA_3)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_3, forecast_CA_3)
mape = np.mean(np.abs((actual_CA_3 - forecast_CA_3) / actual_CA_3)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 1469628.02
Root Mean Squared Error (RMSE): 1212.28
Mean Absolute Error (MAE): 819.96
Mean Absolute Percentage Error (MAPE): nan%
In [682]:
mape_value = mean_absolute_percentage_error(
    actual_CA_3,
    forecast_CA_3
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 317.09%

(4) CA_4 Store Top-Down¶

  • Calculate CA_4 Predicted Revenue
In [683]:
# Extract CA_4 percentage
store_CA_4 = store_shares.loc['CA_4']

# Get CA_4 store revenue forecast
forecast_CA_4 = Foods3_predictions * store_CA_4
forecast_CA_4
Out[683]:
2015-03-14    3118.633907
2015-03-15    3256.895079
2015-03-16    2545.771851
2015-03-17    2309.479356
2015-03-18    2239.092948
                 ...     
2015-12-27    3340.644186
2015-12-28    2633.631658
2015-12-29    2400.024196
2015-12-30    2337.287993
2015-12-31    2209.897152
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
  • Get CA_4 Actual Revenue
In [684]:
# 1. Filter for store_id == 'CA_4'
df_CA4 = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_4'].copy()

# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date   = '2015-12-31'
mask = (df_CA4['date'] >= start_date) & (df_CA4['date'] <= end_date)
df_CA4_filtered = df_CA4.loc[mask]

# 3. Select only 'date' and 'revenue', then sort by date
actual_CA_4 = df_CA4_filtered[['date', 'revenue']].sort_values('date')

# 4. Set date column as index
actual_CA_4.set_index('date', inplace=True)

# 5. Print time series of actual revenue
actual_CA_4
Out[684]:
revenue
date
2015-03-14 2954.2010
2015-03-15 3434.8767
2015-03-16 2702.6387
2015-03-17 2529.3633
2015-03-18 2360.2935
... ...
2015-12-27 1891.2354
2015-12-28 2110.1110
2015-12-29 1747.9710
2015-12-30 1981.7717
2015-12-31 2814.8704

293 rows × 1 columns

  • Plot CA_4 Actual VS Forecast
In [816]:
# Plot actual VS forecast
plt.figure(figsize=(10, 6))

plt.plot(actual_CA_4.index, actual_CA_4, label='Actual CA_4 Revenue')
plt.plot(forecast_CA_4.index, forecast_CA_4, label='Forecast CA_4 Revenue')

# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) CA_4 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events')
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
  • Evaluate CA_4 Top-Down Forecast Result
In [686]:
# Compute evaluation metrics
mse = mean_squared_error(actual_CA_4, forecast_CA_4)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_CA_4, forecast_CA_4)
mape = np.mean(np.abs((actual_CA_4 - forecast_CA_4) / actual_CA_4)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 193354.44
Root Mean Squared Error (RMSE): 439.72
Mean Absolute Error (MAE): 291.09
Mean Absolute Percentage Error (MAPE): nan%
In [687]:
mape_value = mean_absolute_percentage_error(
    actual_CA_4,
    forecast_CA_4
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 3993781937241965568.00%

3. CA_1, CA_2, CA_3, CA_4 Store Revenue Forecasting¶

(1) CA_1 Store Revenue Forecast¶

EDA¶

In [688]:
# Filter for store_id == 'CA_1'
CA_1_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_1'].copy()
CA_1_daily
Out[688]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay week month year
1348 2012-01-01 CA_1 1448 1.645839 3213.3958 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1352 2012-01-02 CA_1 2001 1.645839 4684.9550 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1359 2012-01-03 CA_1 1711 1.645839 3916.0360 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1363 2012-01-04 CA_1 1598 1.645839 3606.1812 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1364 2012-01-05 CA_1 1949 1.645839 4310.1980 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7172 2015-12-27 CA_1 1669 2.921556 4446.1396 0.0 0 0 0 0 0 2015-12-21/2015-12-27 2015-12 2015
7176 2015-12-28 CA_1 1597 2.921556 4131.3706 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7180 2015-12-29 CA_1 1579 2.921556 4129.5820 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7186 2015-12-30 CA_1 1582 2.921556 4345.4190 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7189 2015-12-31 CA_1 1940 2.921556 5333.0360 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015

1461 rows × 14 columns

In [689]:
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_1_daily['date'], CA_1_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_1 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [690]:
# Plot histograms for each feature (Ignore first plot - date column)
CA_1_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [691]:
CA_1_daily.describe()
Out[691]:
date unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
count 1461 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 2013-12-31 00:00:00 2160.395619 2.418859 5031.478264 0.328542 0.002738 0.002738 0.002738 0.002738 0.002738
min 2012-01-01 00:00:00 0.000000 1.645839 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2012-12-31 00:00:00 1755.000000 2.025482 4131.370600 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 2013-12-31 00:00:00 2067.000000 2.425668 4798.596700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2014-12-31 00:00:00 2515.000000 2.784499 5833.535000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 2015-12-31 00:00:00 4042.000000 2.921895 8927.261000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
std NaN 536.165123 0.399183 1251.038288 0.469844 0.052271 0.052271 0.052271 0.052271 0.052271
In [692]:
CA_1_daily.shape
Out[692]:
(1461, 14)
In [693]:
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_1_daily.set_index('date')['revenue'], model='additive', period=365)

# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_1 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
In [694]:
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_1_daily['revenue'])

# Display the seasonality strength
seasonality_strength
Out[694]:
0.5651099818207891

Roughly 57% of the variance in CA_1 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶

In [695]:
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_1_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()

plt.figure(figsize=(12,4))
plot_pacf(CA_1_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
No description has been provided for this image
No description has been provided for this image

PACF is p or AR, it is pretty obvious that p is 1

ACF is q or MA, it is not obvious, need auto ARIMA to further determine q

Begin Modeling¶

  • Apply ADF test to check stationarity
In [696]:
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_1_daily['revenue'])
Augmented Dickey-Fuller Test: 
ADF test statistic        -3.828987
p-value                    0.002624
# lags used               24.000000
# observations          1436.000000
critical value (1%)       -3.434912
critical value (5%)       -2.863555
critical value (10%)      -2.567843
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps

In [262]:
# # Apply first-order differencing
# CA_1_daily['revenue_diff'] = CA_1_daily['revenue'].diff()

# # Perform ADF test again on the differenced series
# adf_test(CA_1_daily['revenue_diff'])
Augmented Dickey-Fuller Test: 
ADF test statistic       -20.042742
p-value                    0.000000
# lags used               26.000000
# observations          1913.000000
critical value (1%)       -3.433773
critical value (5%)       -2.863052
critical value (10%)      -2.567575
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
In [263]:
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_1_daily['date'], CA_1_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
No description has been provided for this image
  • Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
In [697]:
# Define the endogenous (target) variable and exogenous variables
y = CA_1_daily.set_index('date')['revenue']
exo = CA_1_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
    y, 
    exogenous=exo,
    stationary=True,  # Tells auto_arima that the series is already stationary
    seasonal=True, 
    m=7,  # Assuming weekly seasonality
    start_p=1,    # Minimum AR order
    max_p=1,      # Maximum AR order (forces p = 1)
    trace=True, 
    error_action='ignore', 
    suppress_warnings=True, 
    stepwise=True
)

# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order

optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic
 ARIMA(1,0,2)(1,0,1)[7] intercept   : AIC=23614.447, Time=3.76 sec
 ARIMA(0,0,0)(0,0,0)[7] intercept   : AIC=24988.051, Time=0.03 sec
 ARIMA(1,0,0)(1,0,0)[7] intercept   : AIC=23557.927, Time=1.56 sec
 ARIMA(0,0,1)(0,0,1)[7] intercept   : AIC=23932.189, Time=0.67 sec
 ARIMA(0,0,0)(0,0,0)[7]             : AIC=29141.298, Time=0.02 sec
 ARIMA(1,0,0)(0,0,0)[7] intercept   : AIC=24466.946, Time=0.07 sec
 ARIMA(1,0,0)(2,0,0)[7] intercept   : AIC=23701.182, Time=4.73 sec
 ARIMA(1,0,0)(1,0,1)[7] intercept   : AIC=23706.897, Time=1.61 sec
 ARIMA(1,0,0)(0,0,1)[7] intercept   : AIC=23947.127, Time=0.66 sec
 ARIMA(1,0,0)(2,0,1)[7] intercept   : AIC=23733.032, Time=3.63 sec
 ARIMA(0,0,0)(1,0,0)[7] intercept   : AIC=23667.791, Time=1.20 sec
 ARIMA(1,0,1)(1,0,0)[7] intercept   : AIC=inf, Time=1.43 sec
 ARIMA(0,0,1)(1,0,0)[7] intercept   : AIC=23452.630, Time=1.45 sec
 ARIMA(0,0,1)(0,0,0)[7] intercept   : AIC=24363.978, Time=0.13 sec
 ARIMA(0,0,1)(2,0,0)[7] intercept   : AIC=24289.360, Time=2.22 sec
 ARIMA(0,0,1)(1,0,1)[7] intercept   : AIC=23912.490, Time=1.67 sec
 ARIMA(0,0,1)(2,0,1)[7] intercept   : AIC=23837.642, Time=3.70 sec
 ARIMA(0,0,2)(1,0,0)[7] intercept   : AIC=24139.575, Time=1.44 sec
 ARIMA(1,0,2)(1,0,0)[7] intercept   : AIC=23392.921, Time=1.64 sec
 ARIMA(1,0,2)(0,0,0)[7] intercept   : AIC=24310.794, Time=0.28 sec
 ARIMA(1,0,2)(2,0,0)[7] intercept   : AIC=inf, Time=3.60 sec
 ARIMA(1,0,2)(0,0,1)[7] intercept   : AIC=23902.845, Time=1.31 sec
 ARIMA(1,0,2)(2,0,1)[7] intercept   : AIC=23590.297, Time=4.73 sec
 ARIMA(1,0,3)(1,0,0)[7] intercept   : AIC=23480.472, Time=2.31 sec
 ARIMA(0,0,3)(1,0,0)[7] intercept   : AIC=24344.688, Time=2.06 sec
 ARIMA(1,0,2)(1,0,0)[7]             : AIC=inf, Time=1.50 sec

Best model:  ARIMA(1,0,2)(1,0,0)[7] intercept
Total fit time: 47.429 seconds
Out[697]:
((1, 0, 2), (1, 0, 0, 7))
In [698]:
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
Out[698]:
SARIMAX Results
Dep. Variable: y No. Observations: 1461
Model: SARIMAX(1, 0, 2)x(1, 0, [], 7) Log Likelihood -11690.461
Date: Wed, 12 Mar 2025 AIC 23392.921
Time: 19:59:10 BIC 23424.643
Sample: 01-01-2012 HQIC 23404.754
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 2791.5405 182.763 15.274 0.000 2433.331 3149.750
ar.L1 -0.9064 0.027 -33.836 0.000 -0.959 -0.854
ma.L1 1.3494 0.030 45.041 0.000 1.291 1.408
ma.L2 0.4637 0.019 23.914 0.000 0.426 0.502
ar.S.L7 0.7090 0.018 39.561 0.000 0.674 0.744
sigma2 5.203e+05 1.16e+04 44.873 0.000 4.98e+05 5.43e+05
Ljung-Box (L1) (Q): 0.15 Jarque-Bera (JB): 862.43
Prob(Q): 0.70 Prob(JB): 0.00
Heteroskedasticity (H): 1.10 Skew: -0.16
Prob(H) (two-sided): 0.30 Kurtosis: 6.75


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
  • Apply auto ARIMA results to fit SARIMAX model
In [699]:
# Fit the SARIMAX model
sarimax_model = SARIMAX(y, 
                        exog=exo, 
                        order=optimal_order, 
                        seasonal_order=optimal_seasonal_order, 
                        enforce_stationarity=False, 
                        enforce_invertibility=False)

sarimax_results = sarimax_model.fit()

# Display model summary
sarimax_results.summary()
Out[699]:
SARIMAX Results
Dep. Variable: revenue No. Observations: 1461
Model: SARIMAX(1, 0, 2)x(1, 0, [], 7) Log Likelihood -12127.716
Date: Wed, 12 Mar 2025 AIC 24275.431
Time: 19:59:16 BIC 24328.245
Sample: 01-01-2012 HQIC 24295.137
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
event1_Christmas -7.325e-05 185.983 -3.94e-07 1.000 -364.520 364.519
event1_Thanksgiving 3108.4775 182.275 17.054 0.000 2751.226 3465.729
event1_NewYear 3386.6197 362.733 9.336 0.000 2675.676 4097.563
event1_SuperBowl 6335.3692 157.486 40.228 0.000 6026.701 6644.037
event1_LaborDay 6846.0455 168.361 40.663 0.000 6516.065 7176.026
ar.L1 1.0000 0.000 2806.764 0.000 0.999 1.001
ma.L1 -0.7424 0.041 -18.029 0.000 -0.823 -0.662
ma.L2 -0.2460 0.041 -5.989 0.000 -0.326 -0.165
ar.S.L7 0.6160 0.026 23.809 0.000 0.565 0.667
sigma2 1.615e+06 0.772 2.09e+06 0.000 1.61e+06 1.61e+06
Ljung-Box (L1) (Q): 0.54 Jarque-Bera (JB): 5580.17
Prob(Q): 0.46 Prob(JB): 0.00
Heteroskedasticity (H): 0.92 Skew: -0.81
Prob(H) (two-sided): 0.35 Kurtosis: 12.46


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.27e+20. Standard errors may be unstable.
In [700]:
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
No description has been provided for this image
  • Plot actual and predicted trend
In [701]:
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
No description has been provided for this image
  • Evaluate predicted result using MAPE
In [702]:
# Actual values (endogenous variable)
y_true = y  # or test set if we're evaluating out-of-sample

# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f'MAPE: {mape:.2f}%')
MAPE: inf%
In [703]:
mape_value = mean_absolute_percentage_error(
    y_true,
    y_pred
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 5529745410436780032.00%
In [704]:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
)

print(evaluation_results)
Mean Squared Error (MSE): 925908.46
Root Mean Squared Error (RMSE): 962.24
Mean Absolute Error (MAE): 641.76

Model Evaluation¶

  • Train-Test Validation
In [705]:
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y, 
                              exog=train_exo, 
                              order=optimal_order, 
                              seasonal_order=optimal_seasonal_order, 
                              enforce_stationarity=False, 
                              enforce_invertibility=False)

sarimax_results_train = sarimax_model_train.fit()

# Forecast on test set
CA_1_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
  • Check Distribution of MAPE
In [706]:
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_1_predictions

# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)

# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100

# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
    'Actual': actual_values,
    'Forecast': forecast_values,
    'MAPE': mape_row_by_row
})

# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)

# Display the DataFrame
print(df_results)

# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
               Actual     Forecast       MAPE
2015-03-14  6141.6230  6154.238994   0.205418
2015-03-15  7154.1210  6606.418043   7.655769
2015-03-16  4833.7110  5115.085114   5.821079
2015-03-17  3950.7646  4712.327626  19.276345
2015-03-18  4107.2850  4622.632162  12.547149
...               ...          ...        ...
2015-12-27  4446.1396  5172.092356  16.327709
2015-12-28  4131.3706  5172.186530  25.192994
2015-12-29  4129.5820  5172.280705  25.249498
2015-12-30  4345.4190  5172.374883  19.030521
2015-12-31  5333.0360  5172.469062   3.010798

[293 rows x 3 columns]
Overall MAPE: 18.14%
  • Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
In [711]:
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_1_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_1 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Check Evaluation Metrics
In [712]:
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_1_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_1_predictions))
mae = mean_absolute_error(test_y, CA_1_predictions)
mape = np.mean(np.abs((test_y - CA_1_predictions) / test_y)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 1434183.47
Root Mean Squared Error (RMSE): 1197.57
Mean Absolute Error (MAE): 943.87
Mean Absolute Percentage Error (MAPE): inf%
In [713]:
mape_value = mean_absolute_percentage_error(
    test_y,
    CA_1_predictions
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 7949551095629456384.00%

Comparison of CA_1 Forecast Plots with Evaluation Metrics¶

CA_1 Top_Down Forecast CA_1 Direct Forecast
No description has been provided for this image
MSE: 674801
RMSE: 821
MAE: 560
MAPE: infinite
No description has been provided for this image
MSE: 1434183
RMSE: 1198
MAE: 944
MAPE: infinite

Conclusion: Top-down approach is recommended in CA_1 Store Revenue Forecast

(2) CA_2 Store Revenue Forecast¶

EDA¶

In [827]:
# Filter for store_id == 'CA_2'
CA_2_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_2'].copy()
CA_2_daily
Out[827]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay week month year
1351 2012-01-01 CA_2 941 1.473136 1994.1145 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1353 2012-01-02 CA_2 1364 1.473136 2982.6143 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1358 2012-01-03 CA_2 939 1.473136 1945.9517 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1362 2012-01-04 CA_2 853 1.473136 1775.3049 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1365 2012-01-05 CA_2 971 1.473136 2079.2812 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7173 2015-12-27 CA_2 1627 2.923421 4265.5730 0.0 0 0 0 0 0 2015-12-21/2015-12-27 2015-12 2015
7177 2015-12-28 CA_2 1369 2.923421 3385.6113 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7181 2015-12-29 CA_2 1406 2.923421 3380.6145 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7187 2015-12-30 CA_2 1625 2.923421 3926.1920 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7190 2015-12-31 CA_2 2087 2.923421 5141.0130 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015

1461 rows × 14 columns

In [828]:
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_2_daily['date'], CA_2_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_2 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [829]:
# Plot histograms for each feature (Skip first plot - date column)
CA_2_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [830]:
CA_2_daily.describe()
Out[830]:
date unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
count 1461 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 2013-12-31 00:00:00 1277.983573 2.278975 2887.853205 0.328542 0.002738 0.002738 0.002738 0.002738 0.002738
min 2012-01-01 00:00:00 2.000000 1.473136 3.160156 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2012-12-31 00:00:00 1018.000000 1.817200 2294.994400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 2013-12-31 00:00:00 1204.000000 2.281236 2688.572000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2014-12-31 00:00:00 1494.000000 2.747719 3392.054700 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 2015-12-31 00:00:00 2867.000000 2.924174 6952.808000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
std NaN 364.599095 0.469592 836.461332 0.469844 0.052271 0.052271 0.052271 0.052271 0.052271
In [831]:
CA_2_daily.shape
Out[831]:
(1461, 14)
In [832]:
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_2_daily.set_index('date')['revenue'], model='additive', period=365)

# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_2 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
In [833]:
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_2_daily['revenue'])

# Display the seasonality strength
seasonality_strength
Out[833]:
0.300318352234158

Roughly 30% of the variance in CA_2 store revenue series is explained by seasonality, indicating a moderate degree of seasonality¶

In [834]:
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_2_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()

plt.figure(figsize=(12,4))
plot_pacf(CA_2_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
No description has been provided for this image
No description has been provided for this image

PACF is p or AR, it is pretty obvious that p is 1

ACF is q or MA, it is not obvious, need auto ARIMA to further determine q

Begin Modeling¶

  • Apply ADF test to check stationarity
In [835]:
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_2_daily['revenue'])
Augmented Dickey-Fuller Test: 
ADF test statistic        -1.379162
p-value                    0.592225
# lags used               22.000000
# observations          1438.000000
critical value (1%)       -3.434906
critical value (5%)       -2.863552
critical value (10%)      -2.567841
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

Since it is a non-stationary data, we need to difference it to achieve stationary

In [836]:
# Apply first-order differencing
CA_2_daily['revenue_diff'] = CA_2_daily['revenue'].diff()

# Perform ADF test again on the differenced series
adf_test(CA_2_daily['revenue_diff'])
Augmented Dickey-Fuller Test: 
ADF test statistic     -1.122857e+01
p-value                 1.932892e-20
# lags used             2.100000e+01
# observations          1.438000e+03
critical value (1%)    -3.434906e+00
critical value (5%)    -2.863552e+00
critical value (10%)   -2.567841e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Since first-order differencing makes data stationary, d = 1

In [837]:
# Plot the differenced revenue
plt.figure(figsize=(12,6))
plt.plot(CA_2_daily['date'], CA_2_daily['revenue_diff'], label='Differenced Revenue', color='purple')
plt.xlabel('')
plt.ylabel('Differenced Revenue')
plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
In [838]:
# Define the endogenous (target) variable and exogenous variables
y = CA_2_daily.set_index('date')['revenue']
exo = CA_2_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
    y, 
    exogenous=exo, 
    seasonal=True, 
    m=7,  # Assuming weekly seasonality
    start_p=1,    # Minimum AR order
    max_p=1,      # Maximum AR order (forces p = 1)
    trace=True, 
    error_action='ignore', 
    suppress_warnings=True, 
    stepwise=True
)

# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order

optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic
 ARIMA(1,1,2)(1,0,1)[7] intercept   : AIC=21328.553, Time=3.30 sec
 ARIMA(0,1,0)(0,0,0)[7] intercept   : AIC=23384.032, Time=0.04 sec
 ARIMA(1,1,0)(1,0,0)[7] intercept   : AIC=22187.029, Time=1.15 sec
 ARIMA(0,1,1)(0,0,1)[7] intercept   : AIC=22612.359, Time=1.24 sec
 ARIMA(0,1,0)(0,0,0)[7]             : AIC=23382.045, Time=0.02 sec
 ARIMA(1,1,2)(0,0,1)[7] intercept   : AIC=22380.623, Time=2.23 sec
 ARIMA(1,1,2)(1,0,0)[7] intercept   : AIC=inf, Time=1.98 sec
 ARIMA(1,1,2)(2,0,1)[7] intercept   : AIC=inf, Time=4.02 sec
 ARIMA(1,1,2)(1,0,2)[7] intercept   : AIC=21282.002, Time=4.56 sec
 ARIMA(1,1,2)(0,0,2)[7] intercept   : AIC=22056.336, Time=2.36 sec
 ARIMA(1,1,2)(2,0,2)[7] intercept   : AIC=21329.344, Time=5.22 sec
 ARIMA(0,1,2)(1,0,2)[7] intercept   : AIC=inf, Time=4.44 sec
 ARIMA(1,1,1)(1,0,2)[7] intercept   : AIC=inf, Time=4.80 sec
 ARIMA(1,1,3)(1,0,2)[7] intercept   : AIC=inf, Time=6.27 sec
 ARIMA(0,1,1)(1,0,2)[7] intercept   : AIC=inf, Time=3.42 sec
 ARIMA(0,1,3)(1,0,2)[7] intercept   : AIC=21794.315, Time=6.28 sec
 ARIMA(1,1,2)(1,0,2)[7]             : AIC=21295.310, Time=2.90 sec

Best model:  ARIMA(1,1,2)(1,0,2)[7] intercept
Total fit time: 54.256 seconds
Out[838]:
((1, 1, 2), (1, 0, 2, 7))
In [839]:
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
Out[839]:
SARIMAX Results
Dep. Variable: y No. Observations: 1461
Model: SARIMAX(1, 1, 2)x(1, 0, 2, 7) Log Likelihood -10633.001
Date: Wed, 12 Mar 2025 AIC 21282.002
Time: 22:18:04 BIC 21324.292
Sample: 01-01-2012 HQIC 21297.778
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.0028 0.024 -0.117 0.907 -0.049 0.044
ar.L1 0.3974 0.112 3.558 0.000 0.178 0.616
ma.L1 -1.1292 0.122 -9.292 0.000 -1.367 -0.891
ma.L2 0.1766 0.110 1.598 0.110 -0.040 0.393
ar.S.L7 0.9989 0.001 1219.709 0.000 0.997 1.000
ma.S.L7 -0.8375 0.028 -29.874 0.000 -0.892 -0.783
ma.S.L14 0.0244 0.026 0.936 0.349 -0.027 0.076
sigma2 1.231e+05 2094.390 58.775 0.000 1.19e+05 1.27e+05
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 19295.16
Prob(Q): 0.95 Prob(JB): 0.00
Heteroskedasticity (H): 2.23 Skew: -1.39
Prob(H) (two-sided): 0.00 Kurtosis: 20.59


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


  • Apply auto ARIMA results to fit SARIMAX model
In [840]:
# Fit the SARIMAX model
sarimax_model = SARIMAX(y, 
                        exog=exo, 
                        order=optimal_order, 
                        seasonal_order=optimal_seasonal_order, 
                        enforce_stationarity=False, 
                        enforce_invertibility=False)

sarimax_results = sarimax_model.fit()

# Display model summary
sarimax_results.summary()
Out[840]:
SARIMAX Results
Dep. Variable: revenue No. Observations: 1461
Model: SARIMAX(1, 1, 2)x(1, 0, 2, 7) Log Likelihood -10298.829
Date: Wed, 12 Mar 2025 AIC 20621.658
Time: 22:18:15 BIC 20684.951
Sample: 01-01-2012 HQIC 20645.283
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
event1_Christmas -2800.2065 61.289 -45.689 0.000 -2920.330 -2680.083
event1_Thanksgiving -1268.2217 118.327 -10.718 0.000 -1500.138 -1036.305
event1_NewYear -655.0949 313.174 -2.092 0.036 -1268.906 -41.284
event1_SuperBowl 250.1420 79.756 3.136 0.002 93.824 406.460
event1_LaborDay 706.7567 91.645 7.712 0.000 527.136 886.377
ar.L1 0.3004 0.075 4.008 0.000 0.153 0.447
ma.L1 -0.9097 0.081 -11.264 0.000 -1.068 -0.751
ma.L2 -0.0314 0.071 -0.440 0.660 -0.171 0.109
ar.S.L7 0.9932 0.003 286.705 0.000 0.986 1.000
ma.S.L7 -0.8231 0.027 -30.807 0.000 -0.875 -0.771
ma.S.L14 0.0538 0.025 2.178 0.029 0.005 0.102
sigma2 9.2e+04 2531.910 36.335 0.000 8.7e+04 9.7e+04
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 316.66
Prob(Q): 0.98 Prob(JB): 0.00
Heteroskedasticity (H): 2.09 Skew: 0.13
Prob(H) (two-sided): 0.00 Kurtosis: 5.28


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


In [841]:
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
No description has been provided for this image
  • Plot actual and predicted trend
In [842]:
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('')
plt.ylabel('Revenue')
plt.legend()
plt.show()
No description has been provided for this image
  • Evaluate predicted result using MAPE
In [843]:
# Actual values (endogenous variable)
y_true = y  # or test set if we're evaluating out-of-sample

# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f'MAPE: {mape:.2f}%')
MAPE: 56.82%
In [844]:
mape_value = mean_absolute_percentage_error(
    y_true,
    y_pred
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 56.82%
In [845]:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
)

print(evaluation_results)
Mean Squared Error (MSE): 101133.29
Root Mean Squared Error (RMSE): 318.01
Mean Absolute Error (MAE): 231.67

Model Evaluation¶

  • Train-Test Validation
In [846]:
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y, 
                              exog=train_exo, 
                              order=optimal_order, 
                              seasonal_order=optimal_seasonal_order, 
                              enforce_stationarity=False, 
                              enforce_invertibility=False)

sarimax_results_train = sarimax_model_train.fit()

# Forecast on test set
CA_2_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)

Since forecast has very poor accuracy, we manually tune SARIMAX parameters below but still not see improvement. Therefore, we will proceed with auto ARIMA result above¶

In [847]:
# # Split data into train and test sets (80-20 split)
# train_size = int(len(y) * 0.8)
# train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
# train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# # Refit SARIMAX model on training data
# sarimax_model_train = SARIMAX(train_y, 
#                               exog=train_exo, 
#                               order=(5, 1, 5), 
#                               seasonal_order=(2, 0, 1, 7), 
#                               enforce_stationarity=False, 
#                               enforce_invertibility=False)

# sarimax_results_train = sarimax_model_train.fit()

# # Forecast on test set
# CA_2_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
  • Check Distribution of MAPE
In [848]:
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_2_predictions

# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)

# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100

# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
    'Actual': actual_values,
    'Forecast': forecast_values,
    'MAPE': mape_row_by_row
})

# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)

# Display the DataFrame
print(df_results)

# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
               Actual     Forecast       MAPE
2015-03-14  3014.8530  2932.156404   2.742973
2015-03-15  3557.4578  2979.284514  16.252429
2015-03-16  1831.1394  1870.941040   2.173600
2015-03-17  1876.0859  1762.571225   6.050612
2015-03-18  1772.1809  1739.843169   1.824742
...               ...          ...        ...
2015-12-27  4265.5730  2089.978382  51.003573
2015-12-28  3385.6113  1035.757091  69.407088
2015-12-29  3380.6145   931.965251  72.432076
2015-12-30  3926.1920   913.284018  76.738682
2015-12-31  5141.0130   985.625108  80.828193

[293 rows x 3 columns]
Overall MAPE: 51.03%
  • Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
In [849]:
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_2_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_2 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Check Evaluation Metrics
In [850]:
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_2_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_2_predictions))
mae = mean_absolute_error(test_y, CA_2_predictions)
mape = np.mean(np.abs((test_y - CA_2_predictions) / test_y)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 4436671.37
Root Mean Squared Error (RMSE): 2106.34
Mean Absolute Error (MAE): 1791.30
Mean Absolute Percentage Error (MAPE): 145.45%
In [851]:
mape_value = mean_absolute_percentage_error(
    test_y,
    CA_2_predictions
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 145.45%
In [852]:
# Plot actual train data
plt.figure(figsize=(12, 6))
plt.plot(train_y.index, train_y, label="Actual Revenue", color='blue', linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("train_actual")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

To investigate the reason why the forecast is too off, we plot train data above. It could be the train data bias from 2015-01 to 2015-03, since in that period the data trend decays, causing our forecast trend also go downward and not catch actual rising trend from 2015-03 to 2015-12¶

Comparison of CA_2 Forecast Plots with Evaluation Metrics¶

CA_2 Top_Down Forecast CA_2 Direct Forecast
No description has been provided for this image
MSE: 1572627
RMSE: 1254
MAE: 1021
MAPE: 281%
No description has been provided for this image
MSE: 4436671
RMSE: 2106
MAE: 1791
MAPE: 145%

Conclusion: Either statistical approach is not recommended in CA_2 Store Revenue Forecast, suggesting ML & DL approaches

(3) CA_3 Store Revenue Forecast¶

EDA¶

In [770]:
# Filter for store_id == 'CA_3'
CA_3_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_3'].copy()
CA_3_daily
Out[770]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay week month year
1349 2012-01-01 CA_3 2051 1.630491 4246.9053 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1354 2012-01-02 CA_3 3185 1.630491 6694.9766 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1357 2012-01-03 CA_3 2842 1.630491 5886.2085 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1361 2012-01-04 CA_3 2569 1.630491 5463.7550 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1366 2012-01-05 CA_3 2645 1.630491 5683.1963 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7174 2015-12-27 CA_3 2195 2.875478 5824.9710 0.0 0 0 0 0 0 2015-12-21/2015-12-27 2015-12 2015
7178 2015-12-28 CA_3 2205 2.875478 5684.8975 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7182 2015-12-29 CA_3 2053 2.875478 5417.0146 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7185 2015-12-30 CA_3 2269 2.875478 5948.0480 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7191 2015-12-31 CA_3 2682 2.875478 6914.9814 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015

1461 rows × 14 columns

In [771]:
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_3_daily['date'], CA_3_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_3 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [772]:
# Plot histograms for each feature (skip first column - date)
CA_3_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [773]:
CA_3_daily.describe()
Out[773]:
date unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
count 1461 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 2013-12-31 00:00:00 3065.451745 2.387430 6731.888187 0.328542 0.002738 0.002738 0.002738 0.002738 0.002738
min 2012-01-01 00:00:00 0.000000 1.630491 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2012-12-31 00:00:00 2629.000000 1.988541 5855.423000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 2013-12-31 00:00:00 2992.000000 2.396947 6596.341000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2014-12-31 00:00:00 3453.000000 2.762711 7539.036600 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 2015-12-31 00:00:00 5118.000000 2.900847 11199.707000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
std NaN 618.825955 0.400079 1320.156196 0.469844 0.052271 0.052271 0.052271 0.052271 0.052271
In [774]:
CA_3_daily.shape
Out[774]:
(1461, 14)
In [775]:
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_3_daily.set_index('date')['revenue'], model='additive', period=365)

# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_3 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
In [776]:
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_3_daily['revenue'])

# Display the seasonality strength
seasonality_strength
Out[776]:
0.5534479211812071

Roughly 55% of the variance in CA_3 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶

In [777]:
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_3_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()

plt.figure(figsize=(12,4))
plot_pacf(CA_3_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
No description has been provided for this image
No description has been provided for this image

PACF is p or AR, it is pretty obvious that p is 1

ACF is q or MA, it is not obvious, need auto ARIMA to further determine q

Begin Modeling¶

  • Apply ADF test to check stationarity
In [778]:
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_3_daily['revenue'])
Augmented Dickey-Fuller Test: 
ADF test statistic        -3.028698
p-value                    0.032300
# lags used               24.000000
# observations          1436.000000
critical value (1%)       -3.434912
critical value (5%)       -2.863555
critical value (10%)      -2.567843
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps

In [369]:
# # Apply first-order differencing
# CA_3_daily['revenue_diff'] = CA_3_daily['revenue'].diff()

# # Perform ADF test again on the differenced series
# adf_test(CA_3_daily['revenue_diff'])
Augmented Dickey-Fuller Test: 
ADF test statistic       -19.862482
p-value                    0.000000
# lags used               26.000000
# observations          1913.000000
critical value (1%)       -3.433773
critical value (5%)       -2.863052
critical value (10%)      -2.567575
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
In [370]:
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_3_daily['date'], CA_3_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
No description has been provided for this image
  • Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
In [779]:
# Define the endogenous (target) variable and exogenous variables
y = CA_3_daily.set_index('date')['revenue']
exo = CA_3_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
    y, 
    exogenous=exo,
    stationary=True,  # Tells auto_arima that the series is already stationary
    seasonal=True, 
    m=7,  # Assuming weekly seasonality
    start_p=1,    # Minimum AR order
    max_p=1,      # Maximum AR order (forces p = 1)
    trace=True, 
    error_action='ignore', 
    suppress_warnings=True, 
    stepwise=True
)

# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order

optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic
 ARIMA(1,0,2)(1,0,1)[7] intercept   : AIC=25449.209, Time=2.39 sec
 ARIMA(0,0,0)(0,0,0)[7] intercept   : AIC=25145.185, Time=0.03 sec
 ARIMA(1,0,0)(1,0,0)[7] intercept   : AIC=23993.565, Time=0.66 sec
 ARIMA(0,0,1)(0,0,1)[7] intercept   : AIC=24155.931, Time=0.72 sec
 ARIMA(0,0,0)(0,0,0)[7]             : AIC=29959.527, Time=0.02 sec
 ARIMA(1,0,0)(0,0,0)[7] intercept   : AIC=24375.603, Time=0.07 sec
 ARIMA(1,0,0)(2,0,0)[7] intercept   : AIC=23902.358, Time=3.44 sec
 ARIMA(1,0,0)(2,0,1)[7] intercept   : AIC=inf, Time=3.96 sec
 ARIMA(1,0,0)(1,0,1)[7] intercept   : AIC=23913.835, Time=1.39 sec
 ARIMA(0,0,0)(2,0,0)[7] intercept   : AIC=25101.957, Time=2.65 sec
 ARIMA(1,0,1)(2,0,0)[7] intercept   : AIC=23920.590, Time=3.75 sec
 ARIMA(0,0,1)(2,0,0)[7] intercept   : AIC=24497.755, Time=2.64 sec
 ARIMA(1,0,0)(2,0,0)[7]             : AIC=inf, Time=1.04 sec

Best model:  ARIMA(1,0,0)(2,0,0)[7] intercept
Total fit time: 22.768 seconds
Out[779]:
((1, 0, 0), (2, 0, 0, 7))
In [780]:
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
Out[780]:
SARIMAX Results
Dep. Variable: y No. Observations: 1461
Model: SARIMAX(1, 0, 0)x(2, 0, 0, 7) Log Likelihood -11946.179
Date: Wed, 12 Mar 2025 AIC 23902.358
Time: 20:54:42 BIC 23928.793
Sample: 01-01-2012 HQIC 23912.219
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 2391.1829 112.571 21.242 0.000 2170.549 2611.817
ar.L1 0.3721 0.019 19.382 0.000 0.334 0.410
ar.S.L7 0.3209 0.027 11.940 0.000 0.268 0.374
ar.S.L14 0.1263 0.024 5.338 0.000 0.080 0.173
sigma2 7.29e+05 1.62e+04 44.903 0.000 6.97e+05 7.61e+05
Ljung-Box (L1) (Q): 80.94 Jarque-Bera (JB): 1863.03
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.99 Skew: -0.43
Prob(H) (two-sided): 0.91 Kurtosis: 8.46


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


  • Apply auto ARIMA results to fit SARIMAX model
In [781]:
# Fit the SARIMAX model
sarimax_model = SARIMAX(y, 
                        exog=exo, 
                        order=optimal_order, 
                        seasonal_order=optimal_seasonal_order, 
                        enforce_stationarity=False, 
                        enforce_invertibility=False)

sarimax_results = sarimax_model.fit()

# Display model summary
sarimax_results.summary()
Out[781]:
SARIMAX Results
Dep. Variable: revenue No. Observations: 1461
Model: SARIMAX(1, 0, 0)x(2, 0, 0, 7) Log Likelihood -12338.703
Date: Wed, 12 Mar 2025 AIC 24695.406
Time: 20:54:53 BIC 24742.895
Sample: 01-01-2012 HQIC 24713.130
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
event1_Christmas 5.1350 243.896 0.021 0.983 -472.893 483.163
event1_Thanksgiving 4482.3930 242.378 18.493 0.000 4007.342 4957.444
event1_NewYear 4298.8950 325.920 13.190 0.000 3660.103 4937.687
event1_SuperBowl 8513.2784 261.749 32.525 0.000 8000.260 9026.297
event1_LaborDay 8554.2418 246.461 34.708 0.000 8071.187 9037.297
ar.L1 0.3581 0.023 15.638 0.000 0.313 0.403
ar.S.L7 0.5383 0.020 27.345 0.000 0.500 0.577
ar.S.L14 0.4366 0.018 24.206 0.000 0.401 0.472
sigma2 2.114e+06 1.25e+05 16.958 0.000 1.87e+06 2.36e+06
Ljung-Box (L1) (Q): 6.89 Jarque-Bera (JB): 7805.20
Prob(Q): 0.01 Prob(JB): 0.00
Heteroskedasticity (H): 0.90 Skew: -0.84
Prob(H) (two-sided): 0.22 Kurtosis: 14.26


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


In [782]:
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
No description has been provided for this image
  • Plot actual and predicted trend
In [783]:
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
No description has been provided for this image
  • Evaluate predicted result using MAPE
In [784]:
# Actual values (endogenous variable)
y_true = y  # or test set if we're evaluating out-of-sample

# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f'MAPE: {mape:.2f}%')
MAPE: inf%
In [785]:
mape_value = mean_absolute_percentage_error(
    y_true,
    y_pred
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 1391159110222060544.00%
In [786]:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
)

print(evaluation_results)
Mean Squared Error (MSE): 1504035.74
Root Mean Squared Error (RMSE): 1226.39
Mean Absolute Error (MAE): 768.68

Model Evaluation¶

  • Train-Test Validation
In [787]:
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y, 
                              exog=train_exo, 
                              order=optimal_order, 
                              seasonal_order=optimal_seasonal_order, 
                              enforce_stationarity=False, 
                              enforce_invertibility=False)

sarimax_results_train = sarimax_model_train.fit()

# Forecast on test set
CA_3_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
  • Check Distribution of MAPE
In [788]:
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_3_predictions

# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)

# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100

# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
    'Actual': actual_values,
    'Forecast': forecast_values,
    'MAPE': mape_row_by_row
})

# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)

# Display the DataFrame
print(df_results)

# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
                Actual     Forecast       MAPE
2015-03-14   8467.2440  8462.797537   0.052514
2015-03-15  10116.2220  9585.442269   5.246818
2015-03-16   7627.6895  7536.943879   1.189687
2015-03-17   6562.2295  6866.026313   4.629476
2015-03-18   5713.6250  6065.148979   6.152381
...                ...          ...        ...
2015-12-27   5824.9710  4784.281592  17.866002
2015-12-28   5684.8975  3769.431435  33.693942
2015-12-29   5417.0146  3390.932751  37.402185
2015-12-30   5948.0480  2988.907192  49.749780
2015-12-31   6914.9814  3168.309109  54.181958

[293 rows x 3 columns]
Overall MAPE: 23.73%
  • Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
In [793]:
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_3_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_3 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Check Evaluation Metrics
In [790]:
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_3_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_3_predictions))
mae = mean_absolute_error(test_y, CA_3_predictions)
mape = np.mean(np.abs((test_y - CA_3_predictions) / test_y)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 3547322.56
Root Mean Squared Error (RMSE): 1883.43
Mean Absolute Error (MAE): 1603.13
Mean Absolute Percentage Error (MAPE): 164.26%
In [791]:
mape_value = mean_absolute_percentage_error(
    test_y,
    CA_3_predictions
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 164.26%

Comparison of CA_3 Forecast Plots with Evaluation Metrics¶

CA_3 Top_Down Forecast CA_3 Direct Forecast
No description has been provided for this image
MSE: 1469628
RMSE: 1212
MAE: 820
MAPE: 317%
No description has been provided for this image
MSE: 3547323
RMSE: 1883
MAE: 1603
MAPE: 164%

Conclusion: Top-down approach is recommended in CA_3 Store Revenue Forecast

(4) CA_4 Store Revenue Forecast¶

EDA¶

In [794]:
# Filter for store_id == 'CA_4'
CA_4_daily = Foods3_CA_df[Foods3_CA_df['store_id'] == 'CA_4'].copy()
CA_4_daily
Out[794]:
date store_id unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay week month year
1350 2012-01-01 CA_4 738 1.589748 1637.3416 1.0 0 0 1 0 0 2011-12-26/2012-01-01 2012-01 2012
1355 2012-01-02 CA_4 967 1.589748 2332.8342 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1356 2012-01-03 CA_4 876 1.589748 1856.1882 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1360 2012-01-04 CA_4 728 1.589748 1592.6340 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
1367 2012-01-05 CA_4 936 1.589748 2176.2485 1.0 0 0 0 0 0 2012-01-02/2012-01-08 2012-01 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7175 2015-12-27 CA_4 721 2.924935 1891.2354 0.0 0 0 0 0 0 2015-12-21/2015-12-27 2015-12 2015
7179 2015-12-28 CA_4 753 2.924935 2110.1110 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7183 2015-12-29 CA_4 642 2.924935 1747.9710 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7184 2015-12-30 CA_4 713 2.924935 1981.7717 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015
7188 2015-12-31 CA_4 1072 2.924935 2814.8704 0.0 0 0 0 0 0 2015-12-28/2016-01-03 2015-12 2015

1461 rows × 14 columns

In [795]:
# Plot daily revenue trend
plt.figure(figsize=(12,6))
plt.plot(CA_4_daily['date'], CA_4_daily['revenue'], label='Daily Revenue', linewidth=1)
plt.xlabel('')
plt.ylabel('Revenue ($)')
plt.title('CA_4 Store Daily Revenue Trend Over Time')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [796]:
# Plot histograms for each feature
CA_4_daily.iloc[:, 1:].hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [797]:
CA_4_daily.describe()
Out[797]:
date unit_sold sell_price revenue snap_CA event1_Christmas event1_Thanksgiving event1_NewYear event1_SuperBowl event1_LaborDay
count 1461 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 2013-12-31 00:00:00 1066.749487 2.379939 2534.858355 0.328542 0.002738 0.002738 0.002738 0.002738 0.002738
min 2012-01-01 00:00:00 0.000000 1.589748 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2012-12-31 00:00:00 942.000000 1.969490 2245.784200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 2013-12-31 00:00:00 1053.000000 2.371497 2496.722000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2014-12-31 00:00:00 1181.000000 2.760504 2800.404800 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 2015-12-31 00:00:00 1822.000000 2.925345 3987.775400 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
std NaN 187.604346 0.415896 441.667311 0.469844 0.052271 0.052271 0.052271 0.052271 0.052271
In [798]:
CA_4_daily.shape
Out[798]:
(1461, 14)
In [799]:
# Apply decomposition (Additive model for clearer seasonality)
decomposition = seasonal_decompose(CA_4_daily.set_index('date')['revenue'], model='additive', period=365)

# Plot decomposition results
plt.figure(figsize=(12, 8))
fig = decomposition.plot()
fig.axes[0].set_title("")
plt.suptitle(f"Seasonal Decomposition of Revenue - CA_4 Store", fontsize=14)
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
In [800]:
# Extract seasonal component
seasonality_strength = np.var(decomposition.seasonal) / np.var(CA_4_daily['revenue'])

# Display the seasonality strength
seasonality_strength
Out[800]:
0.5383630531175937

Roughly 54% of the variance in CA_4 store revenue series is explained by seasonality, indicating a strong degree of seasonality¶

In [801]:
# Autocorrelation and Partial Autocorrelation plots
plt.figure(figsize=(12,4))
plot_acf(CA_4_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Autocorrelation from lag 1
plt.title('Autocorrelation of Revenue (starting from lag 1)')
plt.show()

plt.figure(figsize=(12,4))
plot_pacf(CA_4_daily['revenue'], lags=50, zero=False, ax=plt.gca())  # Partial Autocorrelation from lag 1
plt.title('Partial Autocorrelation of Revenue (starting from lag 1)')
plt.show()
No description has been provided for this image
No description has been provided for this image

PACF is p or AR, it is pretty obvious that p is 1

ACF is q or MA, it is not obvious, need auto ARIMA to further determine q

Begin Modeling¶

  • Apply ADF test to check stationarity
In [802]:
# Automate Augmented Dickey-Fuller (ADF) test for stationarity
adf_test(CA_4_daily['revenue'])
Augmented Dickey-Fuller Test: 
ADF test statistic        -3.215335
p-value                    0.019114
# lags used               21.000000
# observations          1439.000000
critical value (1%)       -3.434902
critical value (5%)       -2.863551
critical value (10%)      -2.567840
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Since it is a stationary data, we don't need to difference it to achieve stationary, skip the below differencing steps

In [464]:
# # Apply first-order differencing
# CA_4_daily['revenue_diff'] = CA_4_daily['revenue'].diff()

# # Perform ADF test again on the differenced series
# adf_test(CA_4_daily['revenue_diff'])
Augmented Dickey-Fuller Test: 
ADF test statistic     -1.542777e+01
p-value                 2.976135e-28
# lags used             2.600000e+01
# observations          1.913000e+03
critical value (1%)    -3.433773e+00
critical value (5%)    -2.863052e+00
critical value (10%)   -2.567575e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary
In [465]:
# # Plot the differenced revenue
# plt.figure(figsize=(12,6))
# plt.plot(CA_4_daily['date'], CA_4_daily['revenue_diff'], label='Differenced Revenue', color='purple')
# plt.xlabel('')
# plt.ylabel('Differenced Revenue')
# plt.title('First-Order Differencing of Revenue to Achieve Stationarity')
# plt.legend()
# plt.grid(True)
# plt.show()
No description has been provided for this image
  • Use auto ARIMA to determine optimal parameters (p, d, q). Since p=1 from PACF, we force p=1 in auto ARIMA to reduce computing time
In [803]:
# Define the endogenous (target) variable and exogenous variables
y = CA_4_daily.set_index('date')['revenue']
exo = CA_4_daily.set_index('date')[['event1_Christmas', 'event1_Thanksgiving', 'event1_NewYear', 'event1_SuperBowl', 'event1_LaborDay']]

# Use Auto ARIMA to determine optimal SARIMAX parameters
auto_arima_model = pm.auto_arima(
    y, 
    exogenous=exo,
    stationary=True,  # Tells auto_arima that the series is already stationary
    seasonal=True, 
    m=7,  # Assuming weekly seasonality
    start_p=1,    # Minimum AR order
    max_p=1,      # Maximum AR order (forces p = 1)
    trace=True, 
    error_action='ignore', 
    suppress_warnings=True, 
    stepwise=True
)

# Extract the best parameters
optimal_order = auto_arima_model.order
optimal_seasonal_order = auto_arima_model.seasonal_order

optimal_order, optimal_seasonal_order
Performing stepwise search to minimize aic
 ARIMA(1,0,2)(1,0,1)[7] intercept   : AIC=20999.389, Time=3.93 sec
 ARIMA(0,0,0)(0,0,0)[7] intercept   : AIC=21945.745, Time=0.04 sec
 ARIMA(1,0,0)(1,0,0)[7] intercept   : AIC=21094.341, Time=1.80 sec
 ARIMA(0,0,1)(0,0,1)[7] intercept   : AIC=21274.526, Time=0.67 sec
 ARIMA(0,0,0)(0,0,0)[7]             : AIC=27094.127, Time=0.02 sec
 ARIMA(1,0,2)(0,0,1)[7] intercept   : AIC=21257.274, Time=1.16 sec
 ARIMA(1,0,2)(1,0,0)[7] intercept   : AIC=21125.368, Time=2.68 sec
 ARIMA(1,0,2)(2,0,1)[7] intercept   : AIC=inf, Time=4.00 sec
 ARIMA(1,0,2)(1,0,2)[7] intercept   : AIC=inf, Time=4.85 sec
 ARIMA(1,0,2)(0,0,0)[7] intercept   : AIC=21451.657, Time=0.34 sec
 ARIMA(1,0,2)(0,0,2)[7] intercept   : AIC=21146.087, Time=2.17 sec
 ARIMA(1,0,2)(2,0,0)[7] intercept   : AIC=20941.810, Time=3.92 sec
 ARIMA(0,0,2)(2,0,0)[7] intercept   : AIC=21402.372, Time=3.36 sec
 ARIMA(1,0,1)(2,0,0)[7] intercept   : AIC=21065.812, Time=3.81 sec
 ARIMA(1,0,3)(2,0,0)[7] intercept   : AIC=20946.273, Time=3.89 sec
 ARIMA(0,0,1)(2,0,0)[7] intercept   : AIC=21471.782, Time=2.71 sec
 ARIMA(0,0,3)(2,0,0)[7] intercept   : AIC=21411.703, Time=3.79 sec
 ARIMA(1,0,2)(2,0,0)[7]             : AIC=21083.673, Time=1.92 sec

Best model:  ARIMA(1,0,2)(2,0,0)[7] intercept
Total fit time: 45.072 seconds
Out[803]:
((1, 0, 2), (2, 0, 0, 7))
In [804]:
# Applying AUTO_ARIMA model to get p and q automatically
auto_arima_model.summary()
Out[804]:
SARIMAX Results
Dep. Variable: y No. Observations: 1461
Model: SARIMAX(1, 0, 2)x(2, 0, [], 7) Log Likelihood -10463.905
Date: Wed, 12 Mar 2025 AIC 20941.810
Time: 21:06:11 BIC 20978.818
Sample: 01-01-2012 HQIC 20955.615
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 1054.5207 178.457 5.909 0.000 704.752 1404.290
ar.L1 -0.4687 0.240 -1.951 0.051 -0.940 0.002
ma.L1 0.7601 0.237 3.204 0.001 0.295 1.225
ma.L2 0.2058 0.059 3.490 0.000 0.090 0.321
ar.S.L7 0.3668 0.025 14.580 0.000 0.317 0.416
ar.S.L14 0.3505 0.019 18.499 0.000 0.313 0.388
sigma2 9.615e+04 1777.793 54.083 0.000 9.27e+04 9.96e+04
Ljung-Box (L1) (Q): 0.27 Jarque-Bera (JB): 3203.75
Prob(Q): 0.60 Prob(JB): 0.00
Heteroskedasticity (H): 1.26 Skew: -0.51
Prob(H) (two-sided): 0.01 Kurtosis: 10.18


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


  • Apply auto ARIMA results to fit SARIMAX model
In [805]:
# Fit the SARIMAX model
sarimax_model = SARIMAX(y, 
                        exog=exo, 
                        order=optimal_order, 
                        seasonal_order=optimal_seasonal_order, 
                        enforce_stationarity=False, 
                        enforce_invertibility=False)

sarimax_results = sarimax_model.fit()

# Display model summary
sarimax_results.summary()
Out[805]:
SARIMAX Results
Dep. Variable: revenue No. Observations: 1461
Model: SARIMAX(1, 0, 2)x(2, 0, [], 7) Log Likelihood -10808.235
Date: Wed, 12 Mar 2025 AIC 21638.469
Time: 21:06:19 BIC 21696.512
Sample: 01-01-2012 HQIC 21660.132
- 12-31-2015
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
event1_Christmas -0.0001 59.112 -2.35e-06 1.000 -115.858 115.857
event1_Thanksgiving 1845.9065 58.709 31.442 0.000 1730.839 1960.974
event1_NewYear 1672.3256 98.250 17.021 0.000 1479.758 1864.893
event1_SuperBowl 3269.5215 58.107 56.267 0.000 3155.634 3383.409
event1_LaborDay 3200.6348 51.379 62.294 0.000 3099.933 3301.337
ar.L1 0.9999 0.000 2246.155 0.000 0.999 1.001
ma.L1 -0.8346 0.033 -25.069 0.000 -0.900 -0.769
ma.L2 -0.1393 0.035 -4.001 0.000 -0.207 -0.071
ar.S.L7 0.2917 0.024 11.993 0.000 0.244 0.339
ar.S.L14 0.2332 0.028 8.420 0.000 0.179 0.287
sigma2 2.412e+05 0.934 2.58e+05 0.000 2.41e+05 2.41e+05
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 13698.30
Prob(Q): 0.86 Prob(JB): 0.00
Heteroskedasticity (H): 1.02 Skew: -2.09
Prob(H) (two-sided): 0.85 Kurtosis: 17.49


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.12e+18. Standard errors may be unstable.



In [806]:
# Plot diagnostic plots (residuals, etc.)
sarimax_results.plot_diagnostics(figsize=(15, 12))
plt.show()
No description has been provided for this image
  • Plot actual and predicted trend
In [807]:
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Actual', marker='o')
plt.plot(y.index, sarimax_results.fittedvalues, label='Fitted', marker='x', color='red')
plt.title('Actual vs Fitted Values')
plt.xlabel('Date')
plt.ylabel('Revenue')
plt.legend()
plt.show()
No description has been provided for this image
  • Evaluate predicted result using MAPE
In [808]:
# Actual values (endogenous variable)
y_true = y  # or test set if we're evaluating out-of-sample

# In-sample fitted values from SARIMAX
y_pred = sarimax_results.fittedvalues

# Calculate MAPE
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

print(f'MAPE: {mape:.2f}%')
MAPE: inf%
In [809]:
mape_value = mean_absolute_percentage_error(
    y_true,
    y_pred
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 2760581763077460992.00%
In [810]:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
)

print(evaluation_results)
Mean Squared Error (MSE): 177624.62
Root Mean Squared Error (RMSE): 421.46
Mean Absolute Error (MAE): 266.63

Model Evaluation¶

  • Train-Test Validation
In [811]:
# Split data into train and test sets (80-20 split)
train_size = int(len(y) * 0.8)
train_y, test_y = y.iloc[:train_size], y.iloc[train_size:]
train_exo, test_exo = exo.iloc[:train_size, :], exo.iloc[train_size:, :]

# Refit SARIMAX model on training data
sarimax_model_train = SARIMAX(train_y, 
                              exog=train_exo, 
                              order=optimal_order, 
                              seasonal_order=optimal_seasonal_order, 
                              enforce_stationarity=False, 
                              enforce_invertibility=False)

sarimax_results_train = sarimax_model_train.fit()

# Forecast on test set
CA_4_predictions = sarimax_results_train.forecast(steps=len(test_y), exog=test_exo)
  • Check Distribution of MAPE
In [812]:
# Extract the actual and forecast values from the DataFrame
actual_values = test_y.values
forecast_values = CA_4_predictions

# Ensure both are one-dimensional arrays
actual_values = np.squeeze(actual_values)
forecast_values = np.squeeze(forecast_values)

# Calculate the MAPE for each row
mape_row_by_row = np.abs((actual_values - forecast_values) / actual_values) * 100

# Create a DataFrame to display the row-by-row MAPE along with the actual and forecast values
df_results = pd.DataFrame({
    'Actual': actual_values,
    'Forecast': forecast_values,
    'MAPE': mape_row_by_row
})

# Calculate the overall weighted MAPE
weights = actual_values / actual_values.sum()
overall_mape = np.sum(mape_row_by_row * weights)

# Display the DataFrame
print(df_results)

# Print the overall MAPE
print(f"Overall MAPE: {overall_mape:.2f}%")
               Actual     Forecast       MAPE
2015-03-14  2954.2010  2716.140032   8.058388
2015-03-15  3434.8767  3018.978510  12.108097
2015-03-16  2702.6387  2670.007773   1.207373
2015-03-17  2529.3633  2539.704017   0.408827
2015-03-18  2360.2935  2541.756934   7.688172
...               ...          ...        ...
2015-12-27  1891.2354  2714.121739  43.510519
2015-12-28  2110.1110  2714.346296  28.635237
2015-12-29  1747.9710  2714.570874  55.298393
2015-12-30  1981.7717  2714.795473  36.988306
2015-12-31  2814.8704  2715.020093   3.547244

[293 rows x 3 columns]
Overall MAPE: 12.87%
  • Check Overestimate/Underestimate Forecasting & Predicted/Actual Pattern Match
In [813]:
# Plot predicted results
plt.figure(figsize=(12, 6))
plt.plot(test_y.index, test_y, label="Actual Revenue", color='blue', linewidth=2)
plt.plot(test_y.index, CA_4_predictions, label="Predicted Revenue", color='red', linestyle="dashed", linewidth=2)
plt.xlabel("")
plt.ylabel("Revenue")
plt.title("CA_4 Store Revenue Forecasting (2015/3-2015/12) with 5 Major Events")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
  • Check Evaluation Metrics
In [814]:
# Compute evaluation metrics
mse = mean_squared_error(test_y, CA_4_predictions)
rmse = np.sqrt(mean_squared_error(test_y, CA_4_predictions))
mae = mean_absolute_error(test_y, CA_4_predictions)
mape = np.mean(np.abs((test_y - CA_4_predictions) / test_y)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 224556.91
Root Mean Squared Error (RMSE): 473.87
Mean Absolute Error (MAE): 339.80
Mean Absolute Percentage Error (MAPE): inf%
In [815]:
mape_value = mean_absolute_percentage_error(
    test_y,
    CA_4_predictions
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 4171090284982807040.00%

Comparison of CA_4 Forecast Plots with Evaluation Metrics¶

CA_4 Top_Down Forecast CA_4 Direct Forecast
No description has been provided for this image
MSE: 193354
RMSE: 440
MAE: 291
MAPE: infinite
No description has been provided for this image
MSE: 224557
RMSE: 474
MAE: 340
MAPE: infinite

Conclusion: Top-down approach is recommended in CA_4 Store Revenue Forecast

4. Bottom-Up Approach¶

In [ ]:
# # Align them in a single DataFrame
# df_forecasts = pd.DataFrame({
#     'CA_1': CA_1_predictions,
#     'CA_2': CA_2_predictions,
#     'CA_3': CA_3_predictions,
#     'CA_4': CA_4_predictions
# })

# # Sum across the columns to get the state forecast
# df_forecasts['State_Forecast'] = df_forecasts.sum(axis=1)

# # Now you have a time series for the state-level revenue
# state_forecast = df_forecasts['State_Forecast']
  • Calculate Foods3 Predicted Revenue
In [819]:
# Get Foods3 department revenue forecast
forecast_Foods3 = CA_1_predictions + CA_2_predictions + CA_3_predictions + CA_4_predictions
In [821]:
forecast_Foods3
Out[821]:
2015-03-14    20265.332966
2015-03-15    22190.123335
2015-03-16    17192.977805
2015-03-17    15880.629181
2015-03-18    14969.381244
                  ...     
2015-12-27    14760.474069
2015-12-28    12691.721352
2015-12-29    12209.749582
2015-12-30    11789.361566
2015-12-31    12041.423372
Freq: D, Name: predicted_mean, Length: 293, dtype: float64
  • Get Foods3 Actual Revenue
In [822]:
# 1. Copy for Foods3 department
department_daily.copy()

# 2. Restrict to desired date range (2015-03-14 to 2015-12-31)
start_date = '2015-03-14'
end_date   = '2015-12-31'
mask = (department_daily['date'] >= start_date) & (department_daily['date'] <= end_date)
Foods3_filtered = department_daily.loc[mask]

# 3. Select only 'date' and 'revenue', then sort by date
actual_Foods3 = Foods3_filtered[['date', 'revenue']].sort_values('date')

# 4. Set date column as index
actual_Foods3.set_index('date', inplace=True)

# 5. Print time series of actual revenue
actual_Foods3
Out[822]:
revenue
date
2015-03-14 20577.9210
2015-03-15 24262.6775
2015-03-16 16995.1786
2015-03-17 14918.4433
2015-03-18 13953.3844
... ...
2015-12-27 16427.9190
2015-12-28 15311.9904
2015-12-29 14675.1821
2015-12-30 16201.4307
2015-12-31 20203.9008

293 rows × 1 columns

  • Plot Foods3 Actual VS Forecast
In [823]:
# Plot actual VS forecast
plt.figure(figsize=(10, 6))

plt.plot(actual_Foods3.index, actual_Foods3, label='Actual Foods3 Revenue')
plt.plot(forecast_Foods3.index, forecast_Foods3, label='Forecast Foods3 Revenue')

# Add labels and legend
plt.xlabel('')
plt.ylabel('Revenue')
plt.title('State Level (CA) Foods3 Department Revenue (2015/3-2015/12) Forecasting with 5 Major Events')
plt.legend()

# Show the plot
plt.show()
No description has been provided for this image
  • Evaluate Foods3 Bottom-Up Forecast Result
In [824]:
# Compute evaluation metrics
mse = mean_squared_error(actual_Foods3, forecast_Foods3)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_Foods3, forecast_Foods3)
mape = np.mean(np.abs((actual_Foods3 - forecast_Foods3) / actual_Foods3)) * 100

# Display results
evaluation_results = (
    f"Mean Squared Error (MSE): {mse:.2f}\n"
    f"Root Mean Squared Error (RMSE): {rmse:.2f}\n"
    f"Mean Absolute Error (MAE): {mae:.2f}\n"
    f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%"
)

print(evaluation_results)
Mean Squared Error (MSE): 17542113.09
Root Mean Squared Error (RMSE): 4188.33
Mean Absolute Error (MAE): 3363.54
Mean Absolute Percentage Error (MAPE): nan%
In [825]:
mape_value = mean_absolute_percentage_error(
    actual_Foods3,
    forecast_Foods3
) * 100  # sklearn version returns a decimal, so multiply by 100

print(f"MAPE (sklearn): {mape_value:.2f}%")
MAPE (sklearn): 317.61%

Comparison of Foods3 Forecast Plots with Evaluation Metrics¶

Foods3 Bottom_Up Forecast Foods3 Direct Forecast
No description has been provided for this image
MSE: 17542113
RMSE: 4188
MAE: 3364
MAPE: 318%
No description has been provided for this image
MSE: 6068120
RMSE: 2463
MAE: 1553
MAPE: 526%

Conclusion: Bottom-up approach is not recommended in Foods3 Department Revenue Forecast

Summary¶

Store/Department Forecast Recommendation
Foods3 Department Directly Forecast at Department Level is Recommended
CA_1 Store Forecast at Departmental Level & Top-Down Approach is Recommended
CA_2 Store Forecast with ML & DL Model is Recommended
CA_3 Store Forecast at Departmental Level & Top-Down Approach is Recommended
CA_4 Store Forecast at Departmental Level & Top-Down Approach is Recommended

Directly Forecast Department Revenue and Using Top-Down Approach to Get Each Store Revenue is Highly Recommended in This Business Case

Why?¶

Why Top-down Approach is Recommended:¶

  • Capture the 55% seasonal strength and reduce noise like zero-revenue days and daily fluctuation like CA_2 in individual store data

  • It is using one model:

Save computational time and cost

Easier to maintain and debug

Lower likelihood of overfitting

Why Bottom-up Approach is Not Recommended:¶

  • Lead to misaligned inventory or staffing decisions if store-specific errors compound at the department level

  • It is using many models:

Cost abundant resources and lower speed

Difficult to maintain and deploy in production

Hard to explain to clients

Why Hybrid Approach with GBRT, LSTM, and SARIMAX is Better:¶

  • Given our data’s characteristics:

55% strong seasonality

4 zero-revenue days

5 event exogenous variables

Overall non-linear trends

In the above situation, ML model (GBRT) or DL model (LSTM) could reduce errors and improve accuracy; However, SARIMAX is preferred for easy interpretability and faster speed to get store forecasts with top-down approach;

Therefore, a hybrid approach that combines the strengths of all 3 methods could be the most effective solution

In [ ]: